2024 REU Summer Projects
Comparison of Various Feature Selection Techniques for Binary Classification Tasks
REU student: Andriy Tryshnivskyy
Project description: In today's world data is used in almost every field. Data is collected by hospitals, governments, schools, and even farmers. Various processing methods are implemented in order to take the large quantities of information and turn them into results that can be comprehended. One such method is feature selection. For classification problems, feature selection can allow for the filtering of data in order to remove irrelevant information and achieve higher accuracy for classification models. For the project, various industry standard feature selection techniques have been implemented, including Information Gain, Forward Selection, Lasso Regression, and Chi-Squared with Simulated Annealing. In addition to this, a published research paper regarding the use of a variant of Particle Swarm Optimization for feature selection was analyzed and the strategy that was used was implemented and tested as well. All of these techniques were compared using statistical measures.
Anomaly Detection in the Internet of Medical Things
REU student: Grayson Lottes
Project description: The Internet of Things is a rapidly growing field in today’s world. With this, the Internet of Medical Things has grown as well. People use devices such as a wristband that can track your heartrate and oxygen levels to collect data. Once this data is collected, it can be used to properly cater to the patient. Unfortunately, with the rise of the Internet of Things, there has also been a rise in cyber-attacks. IoT devices are weak to these cyber-attacks because they need a constant connection to the internet, and they have limited processing power. They are not typically built with the infrastructure to handle cyber-attacks that have the potential to alter data, halt device operations, or leak data. It is also to be noted that because IoT devices are becoming increasingly popular, it has proved hard to create and enforce standards among the creation of these devices. They can be unique from hardware to software, making some very susceptible to cyber-attacks. Using faulty devices that could be attacked is not practical in use because it can stop or change the true intention of the device. Machine Learning (ML) may have a solution to this issue. Advanced machine learning models can potentially provide anomaly detection given high dimensional data from IoT devices. When detecting an anomaly, machine learning models can be used to classify some data into a type of attack i.e. Denial of Service. For this classification, I will be using the CICIoMT2024 (CIC 2024). This dataset used 40 different IoMT devices (25 real devices and 15 simulated devices). It uses three different connections: Wi-Fi, MQTT, and Bluetooth. Within this dataset, the CIC used machine learning models for binary classification (attack data and benign data), 6-class classification (benign, spoofing attacks, Recon attacks, MQTT attacks, DoS attacks, and DDoS attacks), and 19-class classification (benign, ARP spoofing attacks, Recon Ping Sweep attacks, Recon VulScanattacks, Recon OS Scan attacks, Recon Port Scan attacks, MQTT Malformed Data attacks, MQTT DoS Connect Flood attacks, MQTT DDoS Connect Flood attacks, MQTT DoS Publish Flood attacks, MQTT DDoS Publish Flood attacks, DoS TCP attacks, DoS ICMP attacks, DoS SYN attacks, DoS UDP attacks, DDoS TCP attacks, DDoS ICMP attacks, DDoS SYN attacks, and DDoS UDP attacks). For this research, Bluetooth and profiling data are not used.
Diabetic Foot Ulcer Segmentation Using Machine Learning
REU student: Carlos Andrews
Project description:
In 2021, diabetes affected over 500 million people and was the leading cause of nontraumatic lower extremity amputations. Diabetic foot ulcers (DFUs) affect 15% of diabetic patients, with 6% hospitalized for complications. In the U.S., 14-24% of patients with foot ulcers face amputation. Early DFU detection can reduce amputation risk, but it is challenging [1][2][3]. Machine learning models, especially image segmentation, help identify DFUs in images. The 2022 DFU Challenge winner, HarDNet-DFUS, achieved a dice score of 0.7287 [6]. We have implemented this, as well as the Eff-UNet, a compound scaling method applied to the UNet architecture.
Classification Using Spark-Enabled Swarm Intelligence Algorithms
REU student: Kendrick Dahlin
Project description: Swarm intelligence (SI) describes a collection of models that imitate the behavior of natural phenomena such as birds or ants. Individual entities all act upon the same principles within a group, responding to others in the group and their environment to swarm to a best solution. This behavior is de-centralized and selforganized. The Firefly Algorithm (FA) [2] is a swarm intelligence algorithm modeled after the flashing light emitted by fireflies. A firefly is most attracted to the most intense light they observe. An intensity of another firefly is inversely proportional to the distance, and proportional to the brightness of the firefly.
Comparative Analysis of Feature Selection Methods with a Focus on Genetic Algorithm for Improved Predictive Modeling
REU student: Guillermo Munoz-Perez
Project description: Within the realm of computational science, the handling of high-dimensional data has become a crucial process in the ability to comprehend and garner meaningful insights. Feature selection methods are pivotal in machine learning for enhancing model performance by reducing dimensionality and eliminating irrelevant features. This report introduces several feature
selection methods categorized as filter, embedded, and hybrid approaches: Chi-Square Test ((ST), Ridge Regression (RR), and Genetic Al-gorithm (GA). Our research begins with an in-depth analysis of the Genetic Algorithm (GA), a technique inspired by the process of natural selection. GA employs a population of candidate solutions that evolve over generations to optimize a given objective function. We explore how these methods can be used to select optimal features in a classification dataset. Additionally, we compare the performance of GA with traditional feature selection
methods such as Chi-Square Test and Ridge Regression. The report concludes with a comprehensive overview of our research findings and discusses future directions in the field of
feature selection.
Evaluating Various Feature Selection Algorithms for Enhanced Predictive Modeling
REU student: Sricharan Kotala
Project description: Since the rise of Machine Learning in the late 1980s, it has brought more formalized methods for featureselections with the aim to preprocess data in a different number of ways. There are hundreds of differenttechniques and methods which can typically be classified underneath three different methods: embedded, filter,and wrapper methods. Specifically, in this paper what will be tested is ANOVA (filter) and Elastic Net(embedded), and Binary Particle Swarm Optimization (PSO). Each has their own advantages anddisadvantages in how they preprocess data and it is crucial that we evaluate all of its pros and cons along withthe different types of use cases dependent on the computational problem that we are trying to solve. The bestway to tell the differences between each and evaluate its performance is by performing highly extensive tests ona large amount of varying dimensional datasets. From there, an evaluation of each feature selection can bedone from which will be compared amongst each other.
Segmenting Lesions for Improved Care, Enhanced Management, and Exact Diagnosis
REU student: Chase Guenther
Project description: Multiple Sclerosis (MS) is a debilitating neuroinflammatory disease characterized by demyelination, or damage to the protective sheath surrounding nerve fibers in the brain and spinal cord. Magnetic Resonance Imaging (MRI) plays a vital role in diagnosing and monitoring MS by visualizing lesions, areas of abnormal signal intensity, within the brain. Accurate segmentation of these lesions is crucial for several reasons. It can; Improve diagnostic accuracy and efficiency: Automated segmentation can reduce the time radiologists spend on manual segmentation, allowing them to focus on interpretation and diagnosis; Increase objectivity and consistency: Manual segmentation can be subjective and prone to variability between readers. Automated methods provide more consistent results; and quantify disease burden: Accurate segmentation allows for measurement of lesion volume, a valuable biomarker for tracking disease progression and treatment response.
Classification using Spark- enabled Swarm Intelligence Algorithms
REU student: Angeles Marin Batana
Project description: Machine learning is a rapidly advancing discipline in computational science due tothe increasing demand for the analysis and interpretation of large amounts of data from various sources and consequently, many optimization algorithms have been proposed to enhance performance and accuracy in machine learning models, particularly classification models. This study investigates the performance improvements of 2 Bat Algorithm implementations when parallelized using Apache Spark, measuring efficiency gains from distributing data across partitions and distributing particles across partitions. The speedup and scaleup characteristics of these implementations were evaluated across different core configurations and data sizes, with results indicating diminishing returns with increasing processors and data sizes for both approaches, highlighting improvements for optimized strategies to achieve enhanced scalability.
Tracking Stratocumulus Cloud Breakup with Image Segmentation
REU student: Sam Spieker
Project description: Stratocumulus clouds play an important role in preventing climate change. These low-level (1,000-6,500 ft) clouds are thick and cover large areas of the earth. They reflect solar radiation from the sun. Which helps cool the earth’s surface.However, there is believed to be a decrease in the stratocumulus clouds. Increasing greenhouse gas missions trap more heat and breakup stratocumulus cloud decks. This breakup causes less cloud coverage, thus promoting global warming. It’s a postive feedback loop that furthers global warming and stratocumulus cloud breakup. We currently do not have an accurate percentage of cloud cover nor an understanding of how that percentage has changed over time.
Using Explainable AI on a Cyberattack Classification Model
REU student: William Marsan
Project description: When creating ML models, developers measure their performance with metrics such as accuracy that tend to hide bias against minority classes or metrics such as F1 score that are unintuitive to many. Explainable AI (XAI) [1] addresses this problem by creating intuitive explanations of models which reveal features that the model is using to make decisions. Xplique is an XAI toolkit with attribution methods, attribution metrics, concepts extraction capabilities, and feature visualization capabilities.