4th Annual Data Science Day 2025
4th Annual Data Science Day
April 8th, 2025 | 1:30PM - 7:00PM
Snell Hall 1108
Schedule Overview
1:30PM - 2:30PM
Presenter: Prasanna Balaprakash, Director of AI Programs and Distinguished R&D Staff Scientist, Oak Ridge National Laboratory. For more information, please click the following link: https://www.linkedin.com/in/prasannaprakash/.
Title: Overview of Oak Ridge National Laboratory’s AI Initiative: Advancing Secure, Trustworthy, and Energy-Efficient AI at Scale for Scientific Discovery
Abstract: We will present an overview of the Oak Ridge National Laboratory's Artificial Intelligence Initiative, which aims to advance the domains of science, energy, and national security. At the core of this initiative are two fundamental thrusts: transformative science applications and cross-cutting assurance. The application thrust focuses on developing AI methods to accelerate scientific discoveries, while the cross-cutting assurance thrust ensures that AI systems are secure, trustworthy, and energy-efficient. Secure approaches include alignment, privacy preservation, and robustness testing for AI models. Trustworthiness is achieved through validation and verification processes coupled with advanced techniques in uncertainty quantification and causal reasoning. Meanwhile, energy efficiency is prioritized by developing scalable solutions, integrating edge computing technologies, and adopting a holistic co-design approach that optimizes the synergy between software and hardware resources.
2:30PM - 6:00PM
For a more detailed overview of the General Session, please scroll below.
6:00PM - 7:00PM
Presenter: Jesse Spencer-Smith, Associate Dean for Partnershps and Innovation, Chief Data Scientist & Interim Director of the Data Science Institute, Vanderbilt University. For more information, please click the following link: https://www.linkedin.com/in/jesse-spencer-smith/.
Title: AI and the New Frontiers of Data Science
Abstract: AI offers the possibility tackling problems that previously seemed unapproachable, while at the same time extending the reach and ability to make decisions and act. What group of professionals are best suited to guide companies and organizations as we these new possibilities open up? Data Scientists! Data Scientists have the critical skills and training to best design and assess AI projects. Data Scientists are also among the groups that can benefit most from the new capabilities afforded by these new technologies. We’ll explore the latest developments, and how to make best use.
Detailed General Session
Location: Snell Hall 1102
Presenters: Jeffrey Sumner, Data Scientist, Bath & Body Works, Western Kentucky University alumnus
Abstract: Join Jeffrey Sumner, Data Scientist at Bath & Body Works, as he shares his journey—from analyzing underwear sales and bringing data science to a natural gas pipeline, to unlocking the power of fragrance. Jeffrey will emphasize why deeply understanding the business problem and context is essential to delivering impactful data science solutions. Learn how clarity, asking the right questions, and translating data into meaningful insights can accelerate your career and enhance business outcomes.
Presenters:Jenna Wells, Student Researcher, Department of Analytics & Information Systems, Undergraduate Student in Data Science Grant Recipient, Western Kentucky University
Abstract: This research develops a predictive model to forecast college student dropout rates using RapidMiner™, with the goal of identifying at-risk students early to facilitate timely interventions. Using a Kaggle dataset that includes 4,424 student records and 36 academic, demographic, and socioeconomic features, the study applies machine learning techniques to classify students as either graduates, dropouts, or currently enrolled. The model reveals critical factors influencing dropout rates, such as tuition fees, scholarship status, and second-semester grades. The Logistic Regression model, chosen based on accuracy and performance metrics, shows promising results with a precision of 82.64% and recall of 79.55% for predicting dropouts.
Presenters: Michael Seavers, Graduate Student Researcher, The School of Engineering and Applied Sciences, Western Kentucky University
Abstract: Maximizing efficiency and accuracy is one of the primary goals in deep learning models; however, alternating one typically impacts the other negatively. The present study aims to characterize the relationship between ALU-bound (GPU) and I/O-bound (CPU) operations during the training of convolutional neural networks on a CPU-GPU system. We experimented with varying batch sizes, load workers, persistent workers, prefetch factors, and pin memory settings on two different computers. The first computer has 32GB of CPU memory and 6GB of GPU memory, while the second has 64GB of CPU memory and 8GB of GPU memory. We employed profiling tools to measure distribution between IO and ALU instructions, identifying bottlenecks and inefficiencies in data loading versus computation. Our results demonstrate that a balanced utilization of CPU I/O and GPU ALU operations enhances model efficiency without compromising accuracy. The results also show that increasing data loader workers enhances GPU utilization, while excessively large batch sizes significantly extend training time. Building upon the insights gained from single GPU systems, we plan to address the unique challenges of ALU and I/O instruction optimization in multi-GPU distributed training scenarios.
Presenters: Dr. Belinda J. Petri, Post-doctoral Fellow, Kentucky IDeA Networks of Biomedical Research Excellence Bioinformatics Core, Department of Biochemistry and Molecular Genetics, University of Louisville School of Medicine
Abstract: RNA modifications including N6-methyladenosine (m6A) play crucial roles in the post-transcriptional regulation of gene expression and have been implicated in cancer progression and liver disease. We hypothesized that m6A epitranscriptomic alterations are associated with pathways in ET resistance and in Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD). Direct-RNA sequencing technology (nanopore) was used to detect and map m6A modifications at single-nucleotide resolution to comprehensively profile m6A modifications in ET-resistant LCC9 and ET -sensitive MCF-7 breast cancer cell lines and in AML12 mouse hepatocyte cells treated with or without Polychlorinated biphenyls (PCBs). We incorporated statistical analysis by integrating m6Anet, an existing machine-learning algorithm designed to call m6A modified bases, with a generalized linear model following a binomial distribution analysis to identify significant differential m6A modification ratios (DMR). We identified 61 transcripts with DMR between vehicle-treated LCC9 compared to MCF-7 cells, including many genes with multiple m6A modifications. Additionally, in our mouse hepatocyte model, we found m6A modifications at sites in the Apob transcript commiserate with a previous m6ARIP-seq study in the livers of mice exposed to PCBs. Our findings reveal distinct m6A modification patterns in ET-resistant LCC9 breast cancer cells and PCB-exposed mouse hepatocytes, suggesting a potential role for epitranscriptomic alterations in the development of ET resistance and toxicant-induced liver disease.
Presenters: Samir Pahari, Graduate Student Researcher, Department of Earth, Environmental, and Atmospheric Sciences, Western Kentucky University
Abstract: Flooding is among the most devastating natural disasters, frequently disrupting lives and livelihoods. In Eastern Kentucky, located within the Appalachian Mountains, flooding risk is heightened by steep terrain and narrow valleys. In recent decades, increased settlement in low-lying riverine areas has heightened flood risks due to inadequate disaster management and sustainable planning. This study evaluates flood susceptibility in the North Fork Kentucky sub-basin by combining deep learning (U-net architecture) with geomorphic data from remote sensing and LiDAR-based DEMs. Sentinel-2-based LULC and NDVI, soil, and rainfall data from PRISM were analyzed to generate a detailed flood susceptibility map. Using ArcGIS Pro, 11 raster layers were created, encompassing slope, aspect, elevation, river proximity, topographic wetness index, stream power index, curvature, LULC, soil type, rainfall, and NDVI. Flooded and non-flooded areas were identified using flood inventory mapping through Google Earth Engine and historical flood records. The U-net deep learning model performed classification to distinguish flooded from non-flooded areas, demonstrating its effectiveness in automated image analysis. The results highlight the model’s superiority in mapping flood susceptibility and identifying vulnerable areas. This study aims to mitigate the problem of data scarcity in flood studies and support improved disaster preparedness planning in flood-prone areas of Eastern Kentucky.
Presenters:Victor Aham, Graduate Student Researcher, Department of Earth, Environmental, and Atmospheric Sciences, Western Kentucky University
Abstract: Proper delineation of geobody from 3D seismic data is crucial for subsurface investigations and modeling. Traditionally, experts manually analyze subsurface structures in seismic images by utilizing different seismic attributes. This method, which is both labor-intensive and subjective, poses significant challenges when it comes to defining salt bodies. Advances in machine and deep learning enable automatic geobody delineation, improving accuracy and efficiency. This study examines deep learning for geobody delineation in the Northern Gulf of Mexico Basin with complex salt formations.
Presenters: Dr. Alex Lebedinsky, Associate Dean, Gordon Ford College of Business, Western Kentucky University
Abstract: We will discuss the new interdisciplinary data science programs at WKU followed by a Q&A session for any questions that students might have.
Presenters: Kelly Miller, Student Researcher, The School of Engineering and Applied Sciences, Undergraduate Student in Data Science Grant Recipient, Western Kentucky University; Sarah Thompson, Student Researcher, The Department of Art & Design, Western Kentucky University
Abstract: The TotSpot AR Game project aims to develop an interactive augmented reality (AR) game for toddlers to enhance their cognitive skills through object identification. By overlaying visuals on the real world via a projection onto a car window, the game encourages players to touch the window to identify passing objects outside by highlighting them with colorful projected borders. Its simple user interface of audio and visual cues cater to a young audience to provide an engaging learning experience during car trips. The physical hardware to run the TotSpot will be installed within a 3D-printed container designed to mount to the back of a car seat, allowing for guardians to set up the TotSpot in their car easily. The game itself is developed using Unity. Its object detection features are implemented using a plugin that integrates the Open Computer Vision Library with the game engine. The audio was created using Audacity. By integrating AR via object highlighting into the identification game, the TotSpot provides a layer of immersion to the experience. This project strives to convert the often monotonous environment of a car to something fun and interactive for children, connecting them to the outside world while also being educational.
Presenters: Vedant Garg, Student Researcher, The Gatton Academy of Mathematics and Science
Abstract: Identifying risk factors for COVID-19 lethality is crucial for ensuring timely and personalized treatment. In this study, we developed two COVID-19 prediction models based on patient symptoms to forecast severity and diagnosis using a publicly available Kaggle dataset on COVID-19 cases in Mexico (April 14, 2020). We employed the Shapley Additive Explanation (SHAP) feature selection method to remove four less important features. Additionally, we applied MinMaxScaler to normalize the ‘age’ and ‘medical unit’ features to a range of 0 to 1. The dataset was divided into two case studies: infection status (Case Study 1) and severity (Case Study 2). We used Random Under-Sampling (RUS) for Case Study 1 and a combination of SMOTE and RUS for Case Study 2. Our Random Forest (RF) model was then trained and evaluated using accuracy metrics, achieving 65% accuracy in predicting Case Study 1 and 94% in predicting Case Study 2. SHAP analysis further identified ‘patient type’, ‘medical unit’, ‘age’, ‘pneumonia’, and ‘obesity’ as key COVID-19 severity and infection status predictors.
Presenters: Peter Agaba, Senior Analyst, Mass General Brigham, Western Kentucky University Alumnus
Abstract: Hypertension, a major risk factor for cardiovascular disease, often remains undiagnosed due to its asymptomatic nature. In the United States, over 120 million adults aged 18–85 live with hypertension, representing more than 48% of the adult population. This condition substantially increases the risk of heart disease—the leading cause of death nationwide. Effective management is critical to reducing complications and improving patient outcomes. Machine learning (ML) offers powerful tools for early and accurate diagnosis by uncovering complex patterns in clinical data. By classifying patients into high-risk and low-risk groups, ML models can support personalized treatment strategies, enhance blood pressure control, and advance precision medicine. Additionally, the proposed system incorporates remote monitoring to reduce hospital readmissions and optimize healthcare resource utilization. This approach has the potential to transform hypertension care, improve health outcomes, and reduce overall healthcare costs.
Presenters: Mahamad Sayab Miya, Graduate Student Researcher, Department of Biology, Western Kentucky University
Abstract: The many unique species found in caves exhibit unusual morphological adaptations such as eye loss and restricted dispersal abilities. Despite their ecological significance in these habitats, studies on the phylogeography of cave beetles in the United States remain limited. This research addresses this issue by focusing on the Kentucky endemic cave beetle, Neaphaenops tellkampfii, a polytypic species hypothesized to comprise four subspecies. Based on allozyme data, previous studies have noted high genetic diversity within and high similarity among local populations of N. tellkampfii, suggesting complex evolutionary dynamics. Building upon existing knowledge, this study aims to elucidate the evolutionary history and diversification patterns of Neaphaenops while reassessing the validity of the proposed subspecies. From one to four individuals per cave were collected from 64 caves, including some newly discovered populations. DNA was extracted, and the CO1 gene was amplified to hypothesize their evolutionary history. Trees were built using Bayesian, Maximum likelihood, and Parsimony analyses. Based on analyses, the species showed two separate populations with less than 3% genetic difference. The Barren River was found to be the physical barrier that separates these populations. These results may have broad application to the speciation of many other cave organisms in this important karst region. Further, the presence of gene flow among populations of Neaphaenops indicates previously unrecognized intercave links, helpful to both geologists and subterranean biologists in the identification of cave systems critical for the conservation of unique ecosystems.
Presenters: Kahlil Garmon, Founder of Moneybot
Abstract: In this talk, I will discuss how I got started on Moneybot and the role data analytics plays in how we think about the future of education technology. I will also talk about how the Innovation Campus and the Regional Technology Council played a large role in helping me start and grow my own business. Finally, I will share how students can get involved with the Innovation Campus and the Regional Technology Council.
Presenters: Andrew Toussaint, Graduate Student Researcher, The School of Engineering and Applied Sciences, Western Kentucky University
Abstract: Cognitive diagnosis is the task of determining a student's proficiency in fine grained areas of knowledge based on their responses to an assessment. Recent advances in machine learning technology have facilitated new techniques to perform cognitive diagnosis quickly and accurately on large data sets. Three existing cognitive diagnosis models, as well as a novel reduction model, were trained on student response data from four sources and evaluated to characterize the performance of the models in five different scenarios. Training scenarios included a baseline, a sampled scenario that included 60% fewer training examples, and three under-sampling scenarios where class ratios were adjusted to be equivalent, favor correct responses, and favor incorrect responses respectively. Results from these experiments reaffirm the increased performance of neural network based models at the cost of longer training time. Among the neural network models, those that used labeled knowledge concepts were dependent on the granularity and accuracy of the labeling for good performance. Additionally, sampling was shown to be an effective way to reduce training time with minimal losses in performance.
Presenters: Cole Meador, Student Researcher, Department of Biology, Undergraduate Student in Data Science Grant Recipient, Western Kentucky University
Abstract: In mammals, spinal cord injuries generally result in permanent neurological deficits. In contrast, larval sea lampreys (Petromyzon marinus) demonstrate remarkable regenerative capacity after a spinal cord injury. To elucidate the molecular underpinnings of this recovery, we used spatial transcriptomics to identify trends in gene expression before and after injury in the lamprey spinal cord. Our analysis compared transcriptional profiles in rostral and caudal tissues relative to the transection site at 1- and 3-weeks post-injury against uninjured controls. Significant genes were ranked by the product of significance and the sign of log2 fold change for functional enrichment analysis using fgsea, a popular R package for functional enrichment of expression data and a gene set. Using standard gene ontology (GO) annotations as a functional information source, differences across three nested levels of biological processes were found in all groups. Our analysis highlights key biological functions active during the regeneration process that could become potential targets for future therapeutics aimed at treating spinal cord injuries in humans.