Intern Notes

This write-up outlines the intern’s work on an important project involving maternal morbidity in Indiana. The intern will engage in various stages of data analysis and machine learning development, contributing to ongoing efforts to better understand maternal and child health outcomes over the past decade.

Project Overview:

The intern will collaborate on analyzing a dataset covering maternal morbidity in Indiana, which spans from 2010 to 2019. The dataset contains approximately 820,000 records, including information on mothers and their children. This dataset has already been explored in part through a working paper that can be accessed here: Exploratory Data Analysis for Maternal and Child Health in Indiana: A Decadal Perspective (2010-2020). The the intern will contribute by assisting with both data analysis and machine learning model development.

Responsibilities and Work Plan:

  1. Data Exploration and Analysis: The intern will begin by exploring the maternal morbidity dataset further to identify patterns and meaningful relationships. This involves understanding the data’s structure, identifying missing values, and determining possible correlations between variables that can influence maternal and child health outcomes.
  2. Scaling Kubernetes Cluster for Machine Learning Pipeline: One of the key components of this project will be scaling the Kubernetes cluster to accommodate a machine learning pipeline. The intern will assist in configuring a multi-node Kubernetes cluster, with one master node and multiple worker nodes. This cluster will be used to run Apache Spark jobs, ensuring efficient distribution and parallel processing of the large dataset. Specifically, the cluster will include:
  • Master Node: Responsible for coordinating tasks, managing cluster state, and handling worker nodes.
  • Worker Nodes: These nodes will execute the Spark tasks assigned by the master node, allowing for distributed data processing. The intern will be involved in setting up Helm charts for deployment, configuring resource limits for CPU and memory, and ensuring the scalability of the Spark applications. The Apache Spark cluster will be used to run several regression analyses, such as linear and logistic regressions, to identify significant relationships within the dataset. The intern will also be tasked with optimizing the Spark jobs to minimize run time and resource usage, using techniques such as caching intermediate results and tuning Spark configurations.
  1. Developing an API for Data Access and Analysis: The intern will assist in developing an API that will allow for streamlined data access and analysis. The API will include:
  • SQL Endpoint: This endpoint will enable users to query the dataset directly using SQL-like commands, facilitating efficient extraction and aggregation of data.
  • Python Endpoint: This endpoint will provide a programmatic interface to perform data transformations and run machine learning models using Python. It will integrate with the machine learning pipeline to support rapid experimentation and deployment of analytical models.
  1. Developing Predictive Models: After deriving insights from the dataset, the intern will contribute to building predictive models using GGUF (Generalized Gradient Update Framework). The predictive models will be trained to forecast maternal health outcomes, such as risk factors for complications during childbirth. The intern will work on model training using GPU-enabled nodes in the Kubernetes cluster to accelerate the process. The models will be validated using a test set derived from the dataset and optimized using hyperparameter tuning techniques. Once trained, these models will be packaged and deployed as GGUF-based models that can be used for inferencing. The goal is to publish these models as open-source LLMs (Lightweight Machine Learning Models) for broader research purposes, with minimal fine-tuning required for deployment in new environments.
  2. Open Source Contributions and Documentation: The intern will assist in documenting the machine learning pipeline, including the setup of Kubernetes, Apache Spark configurations, API development, preprocessing steps, and model development processes. This documentation will ensure the reproducibility of results and share valuable knowledge with the broader research community. The final models and the supporting codebase will be published as part of an open-source initiative, making them accessible for further development, fine-tuning, and application by other researchers.

Final Output:

The final output of this internship will be the finalized draft of the working paper, including its predictive models and the complete codebase. This comprehensive output will showcase the research, model development, and technical contributions made during the project.