AUTHOR=Sania Ayesha , Pini Nicolò , Nelson Morgan E. , Myers Michael M. , Shuffrey Lauren C. , Lucchini Maristella , Elliott Amy J. , Odendaal Hein J. , Fifer William P.
TITLE=K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
JOURNAL=Advances in Drug and Alcohol Research
VOLUME=4
YEAR=2025
URL=https://www.frontierspartnerships.org/journals/advances-in-drug-and-alcohol-research/articles/10.3389/adar.2024.13449
DOI=10.3389/adar.2024.13449
ISSN=2674-0001
ABSTRACT=AimsThe objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.
MethodsWe used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the k-NN imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5–15 consecutive days from the first trimester.
ResultsWe found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/−1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data.
Conclusionk-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.