Health care providers, governments, and parents want children to grow up to be healthy, productive citizens. As a result, billions of dollars have been plunged into helping families from birth through adulthood. The first step in this process is to ensure that children are born healthy. In 1986, Baystate Medical Center in Springfield, MA collected data on the birth weights of 189 children. The goal is to use this data to create a few models to determine what factors go into determining whether a child is born at a healthy weight or not. These models can then be used to provide potential mothers with the best information possible for keeping their children healthy.

This dataset is in the MASS package in R. Some of the variables include the mother’s age and weight and whether the mother smoked during the pregnancy.

Of these 189 women, the ages ranged from 14 to 45 and the mothers had a pre-pregnancy weight between 80 and 250 lbs. 96 where white, 26 where black, and 67 had another ethnicity. 74 smoked during the pregnancy while 30 had a history of premature labors. 12 had high blood pressure. 28 had a presence of uterine irritability. 100 did not visit a physician during the first trimester. Low birth weight is determined if a child’s birth weight is under 2500 or 2.5 kg. There were 59 children out of 189 which had a low birth weight. It appears that the numerical birth weight data is somewhat normal.

missing.data <-0for (i in1:length(birthwt)){for (j in1:length(birthwt[i,])) {if (is.na(birthwt[i,j])==TRUE){missing.data=missing.data+1} }}missing.data

[1] 0

Modeling

There are two options for creating a model: attempt to predict the numerical birth weights or attempt to predict which children would have a low birth weight. However, there are a few categorical variables that require a transformation into dummy variables. A sample of 126 points were used to build the models while 63 were set aside.

race1 =factor(birthwt$race, labels =c("white", "black", "other"))race1=relevel(race1, ref="white")low1 =factor(birthwt$low, labels =c("no", "yes"))low1=relevel(low1, ref="no")smoke1=factor(birthwt$smoke, labels =c("no", "yes"))smoke1=relevel(smoke1, ref="no")ht1=factor(birthwt$ht, labels =c("no", "yes"))ht1=relevel(ht1, ref="no")ui1=factor(birthwt$ui, labels =c("no", "yes"))ui1=relevel(ui1, ref="no")birthwt1 =cbind(birthwt, low1, race1, smoke1, ht1, ui1)set.seed(1)train =sample(1:nrow(birthwt1), 126) #The test set is 63 which plays nice with 189

Linear Regression

To attempt to predict the birth weights, an OLS model was used. The mother’s weight, race, smoking, and UI showed to be significant, so a model was built with these variables.

studentized Breusch-Pagan test
data: OLS.reduced
BP = 2.7106, df = 5, p-value = 0.7445

plot(OLS.reduced, which=1)

Based on the resiudal plot and the Breusch-Pagan Test being insignificant, heteroscedasticity does not appear to be a concern. The variance inflation factors indicate that multicollinearity is not a concern either.

Interestingly, non-whites, smokers, and those with a history of UI have smaller children than their counterparts. For each additional pound a mother weighs, the child is 44.5 grams heavier. While the OLS model with all of the variables had more predictive power according to the ANOVA test, and R-squared, the reduced model was selected due to less variables being used. Neither of these models do a decent job of predicting the birth weight. Additional models, including interactive models, did not add any valuable information.

To show how ineffective this model is at prediction, it was used against the test data to show how well it can predict if a child has a low birth weight.

There is an acceptable error rate of 38%, but the sensitivity rate is only 16%, meaning it doesn’t do a good job of prediciting which children have a low birth weight.

##Binomial Logistic Regression A better attempt at creating a model was done using binomial logistic regression. The goal is to predict if a child is born with a low birth weight. Similar to the OLS model, Race, weight, smoking, and UI showed to be significant.

Non-whites, smokers, or those with a history of UI are more likely to have children with a low birth weight than their respective counterparts while heavier women are not. To further evaluate how this model does at prediction, a classification matrix was creatred.

In comparison to the OLS model, this appears to be a worse job of prediction overall. The error rate is higher while both the sensitivity and specificity rates are lower.

##Classification Tree Another attempt involving a classification tree was used to determine if a child had a low birth weight. The full tree indicated an entire node where the results were “no” regardless of the path. As a result, the tree was pruned.

Compared to the previous models, the most important variable appears to be the number of premature labors followed by pre-pregnancy weight then age and a history of UI. Interestingly, race and smoking doesn’t appear to have an effect. Using the pruned tree, an attempt was made to predict if a child has a low birth weight.

While the predictions overall are very accurate and it does a great job of predicting which children won’t have a low birth weight, it does a poor job of indicating which children do have a low birth weight. Boosting and bagging trees were also attempted, but lead to similar results.

Conclusion

Based on most of the models, race, smoking, and UI presence are useful predictors for if a child is born with a low birth rate. As initially anticipated, there are other factors which can determine a child’s weight that was not presented in the data. For as bad as smoking is, alcohol and illegal drugs are more harmful to fetal development. Another interesting variable would have been household income to determine if there is a disparity between poorer and wealthier households. It is unknown how many of these women were married or single.

One somewhat shocking surprise was that age was not significant in most of the models. Typically, younger women have to be more viligant on prenatal car because their bodies are less likely to handle a child. There doesn’t appear to be anything unusual about the data on this variable.

The data, and as a result the model, is restricted to only the Baystate Medical Center. A further study of other hospitals in the area and outside the area is needed to come to any major conclusions. However, it is useful as an initial insight into a major health concern.