Author: Dr. Cary Woods
Affiliation: Managing Partner, HarnessAI (Public Health Informatics & Predictive Analytics)
Contact: cary@harnessai.net
Date: May 2025
1. Abstract
Maternal morbidity remains a critical public health challenge in the United States, particularly among vulnerable and underserved populations. This study analyzes a decade of Indiana birth records (2010–2020), comprising over 813,000 records, to explore the predictors of maternal health complications. Exploratory data analysis (EDA) identified several notable associations, including a strong correlation between general smoking behavior and smoking during pregnancy (r = 0.86), and lower prenatal care attendance among rural mothers compared to urban peers. Using Logistic Regression, Gradient Boosting, and Random Forest classifiers, we evaluated predictive models of birth complications, with Gradient Boosting achieving the highest overall accuracy. Our results suggest that smoking during pregnancy, inadequate prenatal care, and rural residency are primary contributors to elevated maternal morbidity risks. However, the absence of critical social determinants in the dataset—such as income, education, and race/ethnicity—limited the ultimate predictive ceiling of these models. We conclude that improvements in maternal health outcomes will depend not only on better interventions but also on enhanced collection of key socio-demographic health data.
2. Introduction
2.1 Social Considerations
Maternal morbidity represents a critical, yet persistently under-addressed, dimension of reproductive and public health in the United States. Despite global progress in reducing maternal mortality, the United States has witnessed a troubling resurgence in maternal complications over the past two decades, with significant variation by race, geography, and access to care. These complications—ranging from hemorrhage and hypertensive disorders to postpartum infection—serve as sentinel indicators of broader systemic inequities in healthcare delivery and access.
The burden of maternal morbidity disproportionately affects women in rural areas, low-income populations, and communities of color. Regional patterns across the Midwest and similar areas suggest structural barriers tied to provider shortages, socioeconomic disparities, and inconsistent access to early prenatal care.
2.2 Purpose of Research
This study evaluates the predictive capacity of machine learning classifiers in identifying maternal morbidity using a decade of Indiana birth records. The central objective is to assess whether meaningful prediction of maternal complications can be achieved using available administrative features—most notably, smoking behavior, maternal age, parity, marital status, and early prenatal care indicators.
2.3 Literature Review
Prior work in maternal health modeling often relies on hospital data or EHRs that contain detailed clinical histories. In contrast, this study employs raw, real-world administrative data typically collected for surveillance purposes. It thus occupies a unique space in the literature as a pragmatic analysis of what can and cannot be achieved in predictive maternal health analytics using standard state-collected records.
3. Methods
3.1 Data Sources
We analyzed Indiana’s publicly available birth and infant death records from 2010 to 2020, comprising 813,837 records.
Data Availability: The original birth and infant death dataset (2010–2020) is publicly available from the Indiana Department of Health: https://www.in.gov/health/vital-records/birth-and-death-data/
3.2 Data Preprocessing
- Removed high-cardinality IDs (e.g., MOTHER_ID)
- Imputed missing categories with “Unknown”
- Median imputation for numerical gaps (i.e., filling missing numeric values with the median of the column to minimize the influence of outliers and maintain the stability of subsequent model training)
- One-hot encoding with drop-first to reduce multicollinearity (i.e., converting categorical variables into binary indicators while dropping the first category to prevent linear dependence among encoded features, which can distort coefficient estimates in models like Logistic Regression)
3.3 Feature Selection
Key predictors included maternal age group, marital status, county type, smoking behavior, number of prior births, and timing of prenatal care.
3.3a Exploratory Data Analysis
Prior to model development, exploratory data analysis (EDA) was conducted to better understand feature distributions and identify preliminary patterns associated with maternal morbidity.
Several insights emerged:
- The majority of mothers fell within the “25–34 years” age group, but morbidity rates were elevated among mothers aged 35 and older.
- Smoking behavior showed a strong association with adverse outcomes; mothers who smoked before or during pregnancy had higher rates of complications.
- Receipt of prenatal care during the first trimester was associated with lower observed morbidity rates compared to those without early care.
- Rural residency was associated with both lower prenatal care attendance rates and slightly elevated morbidity risk.
- Correlation matrices revealed a strong correlation (r ≈ 0.86) between general smoking status and smoking during pregnancy, indicating behavioral continuity that carries implications for intervention strategies.
Several additional insights were uncovered during expanded exploratory data analysis:
- Maternal morbidity rates exhibited a slight upward trend across the 2010–2020 period.
- Morbidity rates were relatively similar between urban and rural/mixed county types.
- The likelihood of maternal complications increased sharply for mothers aged 45–54.
- A weak negative correlation was observed between the number of prior births and complication risk, indicating slightly higher risk among first-time mothers.
- Infants born to mothers with complications had higher observed rates of low birth weight and transfer to specialized care.
These findings provided early validation of feature selection choices and informed the modeling strategy that followed. The outcome variable was binary: birth complications (Yes/No).
The BIRTH_COMPLICATIONS
feature was treated as a binary indicator, identifying the presence or absence of any complication during birth, to support binary classification modeling.
3.4 Model Development
Three classifiers were developed and trained on an 80/20 stratified split. This standard split balances two priorities: it allocates 80% of the data to train the models with sufficient statistical power while preserving 20% as a holdout test set for unbiased evaluation of performance. Stratification ensures that the proportion of complication and non-complication cases is maintained across both training and testing subsets—an essential step when working with imbalanced outcomes like maternal morbidity. The models used include:
- Logistic Regression
- Random Forest
- Gradient Boosting
Balanced class weights were applied to account for the low incidence of complications. In datasets where the outcome of interest (e.g., maternal morbidity) occurs relatively infrequently, models can become biased toward predicting the majority class (no complication). Applying balanced class weights adjusts the penalty associated with misclassifying the minority class, ensuring that the model pays proportionate attention to both complication and non-complication cases during training. This approach helps improve recall for the minority outcome, which is critical in public health contexts where missing high-risk cases can have serious consequences.
3.5 Evaluation Metrics
Models were assessed by Precision, Recall, F1-Score, and ROC-AUC. Precision measures how many of the cases predicted as complications were truly complications. Recall measures how many of the actual complication cases were correctly identified, making it particularly important in healthcare contexts where missing a high-risk case could lead to serious outcomes. The F1-Score provides a balanced measure combining both Precision and Recall, useful when trade-offs between false positives and false negatives matter. ROC-AUC offers a broader view of model discrimination ability across classification thresholds. Emphasis was placed on Recall because, in maternal morbidity prediction, the cost of failing to identify a high-risk individual (false negative) is typically higher than the cost of a false alarm (false positive).
3.6 Computational Considerations
The analysis was conducted using a lightweight but modular data science stack centered around Python. Key components included Pandas for data wrangling, Scikit-learn for machine learning workflows, and XGBoost for boosting-based classifiers. All model training, evaluation, and preprocessing scripts were containerized to enable consistent runtime environments and straightforward deployment across research or clinical testing environments. The entire workflow is designed to run efficiently on a single-machine setup, with optional scaling via parallelization in future iterations.
All scripts, model artifacts, and a subset of the cleaned dataset will be publicly released via GitHub. The aim is to ensure reproducibility, support external validation, and encourage model refinement by others in the research and public health communities.
3.7 Research Questions
- What are the strongest predictors of maternal morbidity in Indiana birth data?
- Which classification algorithm performs best on these features?
- How do data limitations constrain modeling and policy applications?
3.8 Data Dictionary
- MOTHER_ID: Unique identifier for the mother; anonymized.
- NUM_BIRTHS_BY_MOTHER: Total number of recorded births by the mother.
- CHILD_ID: Unique identifier for the child; anonymized.
- CHILD_BIRTH_YR_GRP: Grouped year of child’s birth (e.g., 2010–2012).
- MOTHER_AGE_GRP: Grouped age of the mother at birth (e.g., 25–34 years).
- MOTHER_MARITALSTATUS_AT_BIRTH: Mother’s marital status at time of birth (e.g., Married, Never Married, Divorced).
- MOTHER_RESID_COUNTY_TYPE: Type of county (Urban, Rural, Rural/Mixed) where the mother resided.
- BIRTH_COMPLICATIONS: Indicator of any birth complications (Yes/No).
- DIABETES_RISK_PREPREGNANCY: Presence of pre-existing diabetes risk before pregnancy (Yes/No).
- DIABETES_RISK_GESTATIONAL: Presence of gestational diabetes risk during pregnancy (Yes/No).
- SMOKING_IND: Indicator if the mother ever smoked (Yes/No/Unknown).
- SMOKING_DURING_PREG_IND: Indicator if the mother smoked during pregnancy (Yes/No/Unknown).
- SMOKING_BEFORE_PREG_IND: Indicator if the mother smoked before pregnancy (Yes/No/Unknown).
- VISITS_IN_1ST_TRIMESTER_IND: Indicator whether the mother had a prenatal care visit in the first trimester (Yes/No/Unknown).
- CHILD_GENDER: Biological sex of the child (Male/Female).
- CHILD_TRANSFERRED: Indicator if the child was transferred to a higher-level facility after birth (Yes/No).
- LOW_BIRTH_WEIGHT: Indicator if the child was born with low birth weight (<2,500 grams) (Yes/No).
- CHILD_BREASTFED: Indicator if the child was breastfed at any point (Yes/No).
- CHILD_ALIVE: Indicator if the child was alive at hospital discharge (Yes/No).
4. Results
4.1 Demographic and Clinical Factors
Of the 813,000+ births, 1.6% involved a maternal morbidity event. Morbidity was significantly associated with advanced maternal age (≥35 years), late or inadequate prenatal care, and smoking during pregnancy. Rural mothers had lower prenatal care attendance and slightly elevated complication rates.
4.2 Model Performance
Model | Precision | Recall | F1-Score | ROC-AUC |
---|---|---|---|---|
Logistic Regression | 0.75 | 0.70 | 0.72 | 0.81 |
Random Forest | 0.80 | 0.77 | 0.78 | 0.86 |
Gradient Boosting | 0.83 | 0.80 | 0.81 | 0.89 |
Gradient Boosting delivered the best overall performance, likely due to its ability to capture complex, nonlinear relationships between risk factors and to adapt to imbalanced outcome distributions. Even Logistic Regression, despite its simplicity, yielded viable results, suggesting that key predictors exert strong, detectable influences on maternal morbidity risk.
4.3 Visual Summaries
- Figure 1: Maternal Age Distribution by Morbidity Status
- Figure 2: ROC Curve Comparisons
5. Discussion
Our findings confirm known risk factors and demonstrate that even limited administrative datasets can be leveraged for moderately accurate maternal health predictions. Gradient Boosting slightly outperformed other models, but all classifiers showed utility.
However, the absence of socioeconomic data significantly hinders interpretability and generalizability. Without indicators like income or education, model outputs may omit key causal pathways and risk profiles. This limitation points to a broader failure of U.S. health data infrastructure to support equity-informed analytics.
From a policy standpoint, predictive models offer triage support and early-warning capabilities—but only if their constraints are acknowledged. Future work should embed models in multi-source systems incorporating EHRs, census data, and community-level indicators.
6. Limitations
- Only administrative birth data used (no pre/postpartum data)
- Lacks income, education, race/ethnicity variables
- Binary classification may obscure complication severity
- Findings based on Indiana data; further validation needed to confirm applicability across regions beyond the American Midwest
These limitations restrict the application of the model beyond descriptive insight and basic triage use cases.
7. Conclusion
This study offers a baseline for maternal health prediction using publicly available state data. Results affirm the role of known risk factors and demonstrate the conditional promise of machine learning in population health. However, the findings also underline the inadequacy of current data systems for meaningful equity-centered interventions.
Our approach reflects a commitment to openness and replication. The models, codebase, and supporting documentation will be fully open-sourced to ensure accessibility for academic and clinical partners. By providing a transparent foundation, we hope to foster collaborative model improvement and accelerate real-world applications that reduce maternal morbidity across similar regions.
Progress will require investment in robust, inclusive health data infrastructure and methodological transparency in model deployment. This work contributes to that foundation, offering a reproducible, adaptable pipeline for early-stage maternal health risk prediction.
References
- Centers for Disease Control and Prevention. Pregnancy Mortality Surveillance System. https://www.cdc.gov/reproductivehealth/maternal-mortality/pregnancy-mortality-surveillance-system.htm
- Centers for Disease Control and Prevention. Severe Maternal Morbidity in the United States. https://www.cdc.gov/reproductivehealth/maternalinfanthealth/severematernalmorbidity.html
- Indiana State Department of Health. Maternal Mortality Review Committee Annual Reports. https://www.in.gov/health/mch/maternal-mortality-review-committee-mmrc/
- Howell EA. Reducing Disparities in Severe Maternal Morbidity and Mortality. Clin Obstet Gynecol. 2018;61(2):387-399. doi:10.1097/GRF.0000000000000349
- Leonard SA, Main EK, Scott KA, Profit J, Carmichael SL. Racial and ethnic disparities in severe maternal morbidity prevalence and trends. Ann Epidemiol. 2019;33:30-36.e2. doi:10.1016/j.annepidem.2019.02.007
- Petersen EE, Davis NL, Goodman D, et al. Racial/Ethnic Disparities in Pregnancy-Related Deaths — United States, 2007–2016. MMWR Morb Mortal Wkly Rep. 2019;68(35):762-765. doi:10.15585/mmwr.mm6835a3
- MacDorman MF, Declercq E, Cabral H, Morton C. Recent Increases in the U.S. Maternal Mortality Rate. Obstet Gynecol. 2016;128(3):447-455.
- Creanga AA, Berg CJ, Syverson C, Seed K, Bruce FC, Callaghan WM. Race, Ethnicity, and Nativity Differentials in Pregnancy-Related Mortality in the United States. Obstet Gynecol. 2012;120(2):261–268.
- Tucker MJ, Berg CJ, Callaghan WM, Hsia J. The Black-White Disparity in Pregnancy-Related Mortality From 5 Conditions: Differences in Prevalence and Case-Fatality Rates. Am J Public Health. 2007;97(2):247-251.
- Fingar KR, Mabry-Hernandez IR, Ngo-Metzger Q, Elixhauser A. Unplanned Cesarean Deliveries and Associated Health Outcomes. HCUP Statistical Brief #205, Agency for Healthcare Research and Quality, 2016.
- Kozhimannil KB, Hardeman RR. Coverage and Access for Pregnant Women in Medicaid Expansion States. JAMA. 2014;311(24):2490-2491.
- Lisonkova S, Muraca GM, Potts J, et al. Maternal age and severe maternal morbidity: A population-based retrospective cohort study. PLoS Med. 2017;14(5):e1002307.
- Bryant AS, Worjoloh A, Caughey AB, Washington AE. Racial/ethnic disparities in obstetric outcomes and care: prevalence and determinants. Am J Obstet Gynecol. 2010;202(4):335-343.
- American College of Obstetricians and Gynecologists. Optimizing Postpartum Care. Committee Opinion No. 736. Obstet Gynecol. 2018;131(5):e140-e150.
- Declercq E, Sakala C, Corry MP, Applebaum S, Herrlich A. Listening to Mothers III: Pregnancy and Birth. New York: Childbirth Connection; 2013.
- Callaghan WM. Overview of Maternal Mortality in the United States. Semin Perinatol. 2012;36(1):2-6.
- Kozhimannil KB, Thao V, Hung P, Tilden EL, Caughey AB. Association Between Hospital Birth Volume and Maternal Morbidity Among Low-Risk Pregnancies in Rural, Urban, and Teaching Hospitals in the United States. Am J Perinatol. 2016;33(6):590-599.
- Davis NL, Smoots AN, Goodman DA. Pregnancy-Related Deaths: Data from 14 U.S. Maternal Mortality Review Committees, 2008–2017. CDC Vital Signs Report, 2019.
- Main EK, McCain CL, Morton CH, Holtby S, Lawton ES. Pregnancy-related mortality in California: causes, characteristics, and improvement opportunities. Obstet Gynecol. 2015;125(4):938-947.
- March of Dimes. Nowhere to Go: Maternity Care Deserts Across the U.S. 2020 Report. https://www.marchofdimes.org