Low Birth Weights

Low Birth Weights

Introduction

Health care providers, governments, and parents want children to grow up to be healthy, productive citizens. As a result, billions of dollars have been plunged into helping families from birth through adulthood. The first step in this process is to ensure that children are born healthy. In 1986, Baystate Medical Center in Springfield, MA collected data on the birth weights of 189 children. The goal is to use this data to create a few models to determine what factors go into determining whether a child is born at a healthy weight or not. These models can then be used to provide potential mothers with the best information possible for keeping their children healthy.

This dataset is in the MASS package in R. Some of the variables include the mother’s age and weight and whether the mother smoked during the pregnancy.

library(MASS)
data(birthwt)
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Of these 189 women, the ages ranged from 14 to 45 and the mothers had a pre-pregnancy weight between 80 and 250 lbs. 96 where white, 26 where black, and 67 had another ethnicity. 74 smoked during the pregnancy while 30 had a history of premature labors. 12 had high blood pressure. 28 had a presence of uterine irritability. 100 did not visit a physician during the first trimester. Low birth weight is determined if a child’s birth weight is under 2500 or 2.5 kg. There were 59 children out of 189 which had a low birth weight. It appears that the numerical birth weight data is somewhat normal.

table(birthwt$low)
## 
##   0   1 
## 130  59
barplot(table(birthwt$low), col="green", main="Birth Weight Barplot")

hist(birthwt$bwt, col="red", main="Birth Weight Histogram", xlab="Birth Weight (g)")

qqnorm(birthwt$bwt)
qqline(birthwt$bwt, col="red")

There are two fortunate things about this dataset. None of the predictor variables are highly correlated and there are no missing data points.

cor(birthwt, use="complete.obs")
##               low         age         lwt         race       smoke
## low    1.00000000 -0.11893928 -0.16962694  0.137792751  0.16140431
## age   -0.11893928  1.00000000  0.18007315 -0.172817953 -0.04434618
## lwt   -0.16962694  0.18007315  1.00000000 -0.165048544 -0.04417908
## race   0.13779275 -0.17281795 -0.16504854  1.000000000 -0.33903074
## smoke  0.16140431 -0.04434618 -0.04417908 -0.339030745  1.00000000
## ptl    0.19608727  0.07160639 -0.14002900  0.007951293  0.18755706
## ht     0.15237025 -0.01583700  0.23636040  0.019929917  0.01340704
## ui     0.16904283 -0.07515558 -0.15276317  0.053602088  0.06215900
## ftv   -0.06296026  0.21539394  0.14052746 -0.098336254 -0.02801314
## bwt   -0.78480516  0.09031781  0.18573328 -0.194713487 -0.19044806
##                ptl          ht          ui         ftv         bwt
## low    0.196087267  0.15237025  0.16904283 -0.06296026 -0.78480516
## age    0.071606386 -0.01583700 -0.07515558  0.21539394  0.09031781
## lwt   -0.140029003  0.23636040 -0.15276317  0.14052746  0.18573328
## race   0.007951293  0.01992992  0.05360209 -0.09833625 -0.19471349
## smoke  0.187557063  0.01340704  0.06215900 -0.02801314 -0.19044806
## ptl    1.000000000 -0.01539958  0.22758534 -0.04442966 -0.15465339
## ht    -0.015399579  1.00000000 -0.10858506 -0.07237255 -0.14598189
## ui     0.227585340 -0.10858506  1.00000000 -0.05952341 -0.28392741
## ftv   -0.044429660 -0.07237255 -0.05952341  1.00000000  0.05831777
## bwt   -0.154653390 -0.14598189 -0.28392741  0.05831777  1.00000000
missing.data = 0
for (i in 1:length(birthwt))
{
  for (j in 1:length(birthwt[i,]))
  {
    if (is.na(birthwt[i,j])==TRUE){missing.data=missing.data+1}
  }
}
missing.data
## [1] 0

Modeling

There are two options for creating a model: attempt to predict the numerical birth weights or attempt to predict which children would have a low birth weight. However, there are a few categorical variables that require a transformation into dummy variables. A sample of 126 points were used to build the models while 63 were set aside.

race1 = factor(birthwt$race, labels = c("white", "black", "other"))
race1=relevel(race1, ref="white")
low1 = factor(birthwt$low, labels = c("no", "yes"))
low1=relevel(low1, ref="no")
smoke1=factor(birthwt$smoke, labels = c("no", "yes"))
smoke1=relevel(smoke1, ref="no")
ht1= factor(birthwt$ht, labels = c("no", "yes"))
ht1=relevel(ht1, ref="no")
ui1= factor(birthwt$ui, labels = c("no", "yes"))
ui1=relevel(ui1, ref="no")
birthwt1 = cbind(birthwt, low1, race1, smoke1, ht1, ui1)
set.seed(1)
train = sample(1:nrow(birthwt1), 126) #The test set is 63 which plays nice with 189

Linear Regression

To attempt to predict the birth weights, an OLS model was used. The mother’s weight, race, smoking, and UI showed to be significant, so a model was built with these variables.

library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
OLS.full = lm(bwt~age+lwt+race1+smoke1+ptl+ht1+ui1+ftv, data=birthwt1, subset=train)
summary(OLS.full)
## 
## Call:
## lm(formula = bwt ~ age + lwt + race1 + smoke1 + ptl + ht1 + ui1 + 
##     ftv, data = birthwt1, subset = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1834.37  -365.44    23.04   432.74  1591.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3137.167    381.706   8.219 3.32e-13 ***
## age           -5.684     11.259  -0.505 0.614590    
## lwt            4.463      2.049   2.178 0.031461 *  
## race1black  -645.771    172.322  -3.747 0.000280 ***
## race1other  -444.428    141.484  -3.141 0.002135 ** 
## smoke1yes   -517.407    127.758  -4.050 9.29e-05 ***
## ptl           77.048    124.773   0.618 0.538114    
## ht1yes      -364.046    281.332  -1.294 0.198233    
## ui1yes      -633.979    168.463  -3.763 0.000265 ***
## ftv          -31.554     53.139  -0.594 0.553805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 640.3 on 116 degrees of freedom
## Multiple R-squared:  0.3115, Adjusted R-squared:  0.2581 
## F-statistic: 5.831 on 9 and 116 DF,  p-value: 1.117e-06
library(leaps)
## Warning: package 'leaps' was built under R version 3.5.2
OLS.subset=regsubsets(bwt~age+lwt+race1+smoke1+ptl+ht1+ui1+ftv, data=birthwt1, subset=train)
summary(OLS.subset)
## Subset selection object
## Call: regsubsets.formula(bwt ~ age + lwt + race1 + smoke1 + ptl + ht1 + 
##     ui1 + ftv, data = birthwt1, subset = train)
## 9 Variables  (and intercept)
##            Forced in Forced out
## age            FALSE      FALSE
## lwt            FALSE      FALSE
## race1black     FALSE      FALSE
## race1other     FALSE      FALSE
## smoke1yes      FALSE      FALSE
## ptl            FALSE      FALSE
## ht1yes         FALSE      FALSE
## ui1yes         FALSE      FALSE
## ftv            FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          age lwt race1black race1other smoke1yes ptl ht1yes ui1yes ftv
## 1  ( 1 ) " " " " " "        " "        " "       " " " "    "*"    " "
## 2  ( 1 ) " " " " " "        " "        "*"       " " " "    "*"    " "
## 3  ( 1 ) " " " " " "        "*"        "*"       " " " "    "*"    " "
## 4  ( 1 ) " " " " "*"        "*"        "*"       " " " "    "*"    " "
## 5  ( 1 ) " " "*" "*"        "*"        "*"       " " " "    "*"    " "
## 6  ( 1 ) " " "*" "*"        "*"        "*"       " " "*"    "*"    " "
## 7  ( 1 ) " " "*" "*"        "*"        "*"       " " "*"    "*"    "*"
## 8  ( 1 ) " " "*" "*"        "*"        "*"       "*" "*"    "*"    "*"
OLS.reduced=lm(bwt~race1+smoke1+ui1+lwt, data=birthwt1, subset=train)
summary(OLS.reduced)
## 
## Call:
## lm(formula = bwt ~ race1 + smoke1 + ui1 + lwt, data = birthwt1, 
##     subset = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1924.14  -360.51    49.64   435.04  1475.55 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3077.390    284.859  10.803  < 2e-16 ***
## race1black  -635.440    166.914  -3.807 0.000223 ***
## race1other  -439.324    137.835  -3.187 0.001831 ** 
## smoke1yes   -501.802    124.007  -4.047 9.24e-05 ***
## ui1yes      -594.193    161.488  -3.679 0.000351 ***
## lwt            3.553      1.950   1.822 0.070979 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 636.3 on 120 degrees of freedom
## Multiple R-squared:  0.2967, Adjusted R-squared:  0.2674 
## F-statistic: 10.12 on 5 and 120 DF,  p-value: 4.166e-08
anova(OLS.full, OLS.reduced)
## Analysis of Variance Table
## 
## Model 1: bwt ~ age + lwt + race1 + smoke1 + ptl + ht1 + ui1 + ftv
## Model 2: bwt ~ race1 + smoke1 + ui1 + lwt
##   Res.Df      RSS Df Sum of Sq     F Pr(>F)
## 1    116 47565392                          
## 2    120 48588924 -4  -1023532 0.624 0.6463
vif(OLS.reduced)
##            GVIF Df GVIF^(1/(2*Df))
## race1  1.246028  2        1.056530
## smoke1 1.145375  1        1.070222
## ui1    1.039183  1        1.019403
## lwt    1.134997  1        1.065362
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.5.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(OLS.reduced, data=birthwt1)
## 
##  studentized Breusch-Pagan test
## 
## data:  OLS.reduced
## BP = 2.1088, df = 5, p-value = 0.8339
plot(OLS.reduced, which=1)

Based on the resiudal plot and the Breusch-Pagan Test being insignificant, heteroscedasticity does not appear to be a concern. The variance inflation factors indicate that multicollinearity is not a concern either.

Interestingly, non-whites, smokers, and those with a history of UI have smaller children than their counterparts. For each additional pound a mother weighs, the child is 44.5 grams heavier. While the OLS model with all of the variables had more predictive power according to the ANOVA test, and R-squared, the reduced model was selected due to less variables being used. Neither of these models do a decent job of predicting the birth weight. Additional models, including interactive models, did not add any valuable information.

To show how ineffective this model is at prediction, it was used against the test data to show how well it can predict if a child has a low birth weight.

reg.pred = predict(OLS.reduced, birthwt1[-train,])
reg.low=ifelse(reg.pred>2500, 0, 1)
reg.table=table(reg.low, birthwt1$low[-train])
reg.table
##        
## reg.low  0  1
##       0 33 19
##       1  5  6
true.negatives.reg = reg.table[1,1]
true.positives.reg = reg.table[2,2]

false.positives.reg = reg.table[2,1]
false.negatives.reg = reg.table[1,2]

all.positives.reg = true.positives.reg+false.negatives.reg
all.negatives.reg = true.negatives.reg+false.positives.reg 

all.obs.reg = all.positives.reg+all.negatives.reg
error.rate.reg = (false.positives.reg+false.negatives.reg)/all.obs.reg

sensitivity.reg=true.positives.reg/all.positives.reg
specificity.reg=true.negatives.reg/all.negatives.reg

results.reg = c(error.rate.reg, sensitivity.reg, specificity.reg)
names(results.reg) = c("Error Rate", "Sensitivity", "Specificity")
results.reg
##  Error Rate Sensitivity Specificity 
##   0.3809524   0.2400000   0.8684211

There is an acceptable error rate of 38%, but the sensitivity rate is only 16%, meaning it doesn’t do a good job of prediciting which children have a low birth weight.

Binomial Logistic Regression

A better attempt at creating a model was done using binomial logistic regression. The goal is to predict if a child is born with a low birth weight. Similar to the OLS model, Race, weight, smoking, and UI showed to be significant.

bi.log = glm(low1~lwt+race1+smoke1+ui1, data=birthwt1, subset=train, 
             family=binomial(link="logit"))
summary(bi.log)
## 
## Call:
## glm(formula = low1 ~ lwt + race1 + smoke1 + ui1, family = binomial(link = "logit"), 
##     data = birthwt1, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5312  -0.7353  -0.4968   0.7137   2.2064  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -0.58035    1.18819  -0.488  0.62525   
## lwt         -0.01602    0.00865  -1.852  0.06406 . 
## race1black   1.77052    0.65778   2.692  0.00711 **
## race1other   1.06012    0.59395   1.785  0.07428 . 
## smoke1yes    1.54959    0.52750   2.938  0.00331 **
## ui1yes       1.25122    0.56594   2.211  0.02705 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 146.94  on 125  degrees of freedom
## Residual deviance: 122.28  on 120  degrees of freedom
## AIC: 134.28
## 
## Number of Fisher Scoring iterations: 5
log.odds = coef(bi.log)
odds <- exp(log.odds)
prob = odds/(1+odds)
cbind(log.odds, odds, prob)
##                log.odds      odds      prob
## (Intercept) -0.58034936 0.5597028 0.3588522
## lwt         -0.01601724 0.9841103 0.4959958
## race1black   1.77052381 5.8739294 0.8545228
## race1other   1.06011723 2.8867094 0.7427129
## smoke1yes    1.54959202 4.7095484 0.8248548
## ui1yes       1.25121618 3.4945904 0.7775103

Non-whites, smokers, or those with a history of UI are more likely to have children with a low birth weight than their respective counterparts while heavier women are not. To further evaluate how this model does at prediction, a classification matrix was creatred.

log.pred = predict(bi.log, birthwt1[-train,], type="response")
pred.log.low = ifelse(log.pred>0.5, 1, 0)
log.table=table(pred.log.low, birthwt1$low[-train])
true.negatives = log.table[1,1]
true.positives = log.table[2,2]

false.positives = log.table[2,1]
false.negatives = log.table[1,2]

all.positives = true.positives+false.negatives 
all.negatives = true.negatives+false.positives 

all.obs = all.positives+all.negatives


error.rate.log = (false.positives+false.negatives)/all.obs

sensitivity.log=true.positives/all.positives
specificity.log=true.negatives/all.negatives

results.log = c(error.rate.log, sensitivity.log, specificity.log)
names(results.log) = c("Error Rate", "Sensitivity", "Specificity")
results.log
##  Error Rate Sensitivity Specificity 
##   0.4126984   0.2000000   0.8421053
rbind(results.reg, results.log)
##             Error Rate Sensitivity Specificity
## results.reg  0.3809524        0.24   0.8684211
## results.log  0.4126984        0.20   0.8421053

In comparison to the OLS model, this appears to be a worse job of prediction overall. The error rate is higher while both the sensitivity and specificity rates are lower.

Classification Tree

Another attempt involving a classification tree was used to determine if a child had a low birth weight. The full tree indicated an entire node where the results were “no” regardless of the path. As a result, the tree was pruned.

library(tree)
## Warning: package 'tree' was built under R version 3.5.2
class.tree = tree(low1~age+lwt+race1+smoke1+ptl+ht1+ui1+ftv, data=birthwt1, subset=train)
plot(class.tree)
text(class.tree,pretty=0)

prune.classtree.fit= prune.misclass(class.tree,best=5)
plot(prune.classtree.fit)
text(prune.classtree.fit, pretty=0)

Compared to the previous models, the most important variable appears to be the number of premature labors followed by pre-pregnancy weight then age and a history of UI. Interestingly, race and smoking doesn’t appear to have an effect. Using the pruned tree, an attempt was made to predict if a child has a low birth weight.

prune.predict=predict(prune.classtree.fit, birthwt1[-train,], type="class")
prune.predict.table=table(prune.predict, birthwt1$low[-train])
prune.predict.table
##              
## prune.predict  0  1
##           no  37 17
##           yes  1  8
true.negatives.prune.predict = prune.predict.table[1,1]
true.positives.prune.predict = prune.predict.table[2,2]

false.positives.prune.predict = prune.predict.table[2,1]
false.negatives.prune.predict = prune.predict.table[1,2]

all.positives.prune.predict = true.negatives.prune.predict+false.negatives.prune.predict 
all.negatives.prune.predict = true.negatives.prune.predict+false.positives.prune.predict 

all.obs.prune.predict = all.positives.prune.predict+all.negatives.prune.predict


error.rate.prune.predict = (false.positives.prune.predict+false.negatives.prune.predict)/all.obs.prune.predict

sensitivity.prune.predict=true.positives.prune.predict/all.positives.prune.predict
specificity.prune.predict=true.negatives.prune.predict/all.negatives.prune.predict

results.prune.predict = c(error.rate.prune.predict, sensitivity.prune.predict, specificity.prune.predict)
names(results.prune.predict) = c("Error Rate", "Sensitivity", "Specificity")
results.prune.predict
##  Error Rate Sensitivity Specificity 
##   0.1956522   0.1481481   0.9736842
rbind(results.reg, results.log, results.prune.predict)
##                       Error Rate Sensitivity Specificity
## results.reg            0.3809524   0.2400000   0.8684211
## results.log            0.4126984   0.2000000   0.8421053
## results.prune.predict  0.1956522   0.1481481   0.9736842

While the predictions overall are very accurate and it does a great job of predicting which children won’t have a low birth weight, it does a poor job of indicating which children do have a low birth weight. Boosting and bagging trees were also attempted, but lead to similar results.

Conclusion

Based on most of the models, race, smoking, and UI presence are useful predictors for if a child is born with a low birth rate. As initially anticipated, there are other factors which can determine a child’s weight that was not presented in the data. For as bad as smoking is, alcohol and illegal drugs are more harmful to fetal development. Another interesting variable would have been household income to determine if there is a disparity between poorer and wealthier households. It is unknown how many of these women were married or single.

hist(birthwt$age, col="cyan", main="Age Histogram", xlab="Mother's Age")

One somewhat shocking surprise was that age was not significant in most of the models. Typically, younger women have to be more viligant on prenatal car because their bodies are less likely to handle a child. There doesn’t appear to be anything unusual about the data on this variable.

The data, and as a result the model, is restricted to only the Baystate Medical Center. A further study of other hospitals in the area and outside the area is needed to come to any major conclusions. However, it is useful as an initial insight into a major health concern.

Avatar
Robert Underwood
Budding Data Analyst