Low Birth Weights

data analysis

predictive analytics

Author

Robert Underwood

Published

March 12, 2017

Introduction

Health care providers, governments, and parents want children to grow up to be healthy, productive citizens. As a result, billions of dollars have been plunged into helping families from birth through adulthood. The first step in this process is to ensure that children are born healthy. In 1986, Baystate Medical Center in Springfield, MA collected data on the birth weights of 189 children. The goal is to use this data to create a few models to determine what factors go into determining whether a child is born at a healthy weight or not. These models can then be used to provide potential mothers with the best information possible for keeping their children healthy.

This dataset is in the MASS package in R. Some of the variables include the mother’s age and weight and whether the mother smoked during the pregnancy.

library(MASS)
data(birthwt)
head(birthwt)

   low age lwt race smoke ptl ht ui ftv  bwt
85   0  19 182    2     0   0  0  1   0 2523
86   0  33 155    3     0   0  0  0   3 2551
87   0  20 105    1     1   0  0  0   1 2557
88   0  21 108    1     1   0  0  1   2 2594
89   0  18 107    1     1   0  0  1   0 2600
91   0  21 124    3     0   0  0  0   0 2622

Of these 189 women, the ages ranged from 14 to 45 and the mothers had a pre-pregnancy weight between 80 and 250 lbs. 96 where white, 26 where black, and 67 had another ethnicity. 74 smoked during the pregnancy while 30 had a history of premature labors. 12 had high blood pressure. 28 had a presence of uterine irritability. 100 did not visit a physician during the first trimester. Low birth weight is determined if a child’s birth weight is under 2500 or 2.5 kg. There were 59 children out of 189 which had a low birth weight. It appears that the numerical birth weight data is somewhat normal.

table(birthwt$low)


  0   1 
130  59

barplot(table(birthwt$low), col="green", main="Birth Weight Barplot")

hist(birthwt$bwt, col="red", main="Birth Weight Histogram", xlab="Birth Weight (g)")

qqnorm(birthwt$bwt)
qqline(birthwt$bwt, col="red")

There are two fortunate things about this dataset. None of the predictor variables are highly correlated and there are no missing data points.

cor(birthwt, use="complete.obs")

              low         age         lwt         race       smoke          ptl
low    1.00000000 -0.11893928 -0.16962694  0.137792751  0.16140431  0.196087267
age   -0.11893928  1.00000000  0.18007315 -0.172817953 -0.04434618  0.071606386
lwt   -0.16962694  0.18007315  1.00000000 -0.165048544 -0.04417908 -0.140029003
race   0.13779275 -0.17281795 -0.16504854  1.000000000 -0.33903074  0.007951293
smoke  0.16140431 -0.04434618 -0.04417908 -0.339030745  1.00000000  0.187557063
ptl    0.19608727  0.07160639 -0.14002900  0.007951293  0.18755706  1.000000000
ht     0.15237025 -0.01583700  0.23636040  0.019929917  0.01340704 -0.015399579
ui     0.16904283 -0.07515558 -0.15276317  0.053602088  0.06215900  0.227585340
ftv   -0.06296026  0.21539394  0.14052746 -0.098336254 -0.02801314 -0.044429660
bwt   -0.78480516  0.09031781  0.18573328 -0.194713487 -0.19044806 -0.154653390
               ht          ui         ftv         bwt
low    0.15237025  0.16904283 -0.06296026 -0.78480516
age   -0.01583700 -0.07515558  0.21539394  0.09031781
lwt    0.23636040 -0.15276317  0.14052746  0.18573328
race   0.01992992  0.05360209 -0.09833625 -0.19471349
smoke  0.01340704  0.06215900 -0.02801314 -0.19044806
ptl   -0.01539958  0.22758534 -0.04442966 -0.15465339
ht     1.00000000 -0.10858506 -0.07237255 -0.14598189
ui    -0.10858506  1.00000000 -0.05952341 -0.28392741
ftv   -0.07237255 -0.05952341  1.00000000  0.05831777
bwt   -0.14598189 -0.28392741  0.05831777  1.00000000

missing.data <- 0
for (i in 1:length(birthwt))
{
  for (j in 1:length(birthwt[i,]))
  {
    if (is.na(birthwt[i,j])==TRUE){missing.data=missing.data+1}
  }
}
missing.data

[1] 0

Modeling

There are two options for creating a model: attempt to predict the numerical birth weights or attempt to predict which children would have a low birth weight. However, there are a few categorical variables that require a transformation into dummy variables. A sample of 126 points were used to build the models while 63 were set aside.

race1 = factor(birthwt$race, labels = c("white", "black", "other"))
race1=relevel(race1, ref="white")
low1 = factor(birthwt$low, labels = c("no", "yes"))
low1=relevel(low1, ref="no")
smoke1=factor(birthwt$smoke, labels = c("no", "yes"))
smoke1=relevel(smoke1, ref="no")
ht1= factor(birthwt$ht, labels = c("no", "yes"))
ht1=relevel(ht1, ref="no")
ui1= factor(birthwt$ui, labels = c("no", "yes"))
ui1=relevel(ui1, ref="no")
birthwt1 = cbind(birthwt, low1, race1, smoke1, ht1, ui1)
set.seed(1)
train = sample(1:nrow(birthwt1), 126) #The test set is 63 which plays nice with 189

Linear Regression

To attempt to predict the birth weights, an OLS model was used. The mother’s weight, race, smoking, and UI showed to be significant, so a model was built with these variables.

OLS.full = lm(bwt~age+lwt+race1+smoke1+ptl+ht1+ui1+ftv, data=birthwt1, subset=train)
summary(OLS.full)


Call:
lm(formula = bwt ~ age + lwt + race1 + smoke1 + ptl + ht1 + ui1 + 
    ftv, data = birthwt1, subset = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1827.43  -481.96    71.94   455.52  1669.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2982.620    388.492   7.677 5.61e-12 ***
age           -2.600     11.999  -0.217  0.82886    
lwt            3.643      2.016   1.807  0.07335 .  
race1black  -475.292    174.895  -2.718  0.00758 ** 
race1other  -394.626    151.534  -2.604  0.01041 *  
smoke1yes   -424.881    140.230  -3.030  0.00302 ** 
ptl          -19.328    139.049  -0.139  0.88969    
ht1yes      -461.350    224.715  -2.053  0.04232 *  
ui1yes      -546.909    176.002  -3.107  0.00237 ** 
ftv            7.241     58.408   0.124  0.90156    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 678.6 on 116 degrees of freedom
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.1985 
F-statistic: 4.441 on 9 and 116 DF,  p-value: 5.181e-05

OLS.subset=regsubsets(bwt~age+lwt+race1+smoke1+ptl+ht1+ui1+ftv, data=birthwt1, subset=train)
summary(OLS.subset)

Subset selection object
Call: regsubsets.formula(bwt ~ age + lwt + race1 + smoke1 + ptl + ht1 + 
    ui1 + ftv, data = birthwt1, subset = train)
9 Variables  (and intercept)
           Forced in Forced out
age            FALSE      FALSE
lwt            FALSE      FALSE
race1black     FALSE      FALSE
race1other     FALSE      FALSE
smoke1yes      FALSE      FALSE
ptl            FALSE      FALSE
ht1yes         FALSE      FALSE
ui1yes         FALSE      FALSE
ftv            FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         age lwt race1black race1other smoke1yes ptl ht1yes ui1yes ftv
1  ( 1 ) " " " " " "        " "        " "       " " " "    "*"    " "
2  ( 1 ) " " " " " "        " "        "*"       " " " "    "*"    " "
3  ( 1 ) " " " " " "        "*"        "*"       " " " "    "*"    " "
4  ( 1 ) " " " " "*"        "*"        "*"       " " " "    "*"    " "
5  ( 1 ) " " " " "*"        "*"        "*"       " " "*"    "*"    " "
6  ( 1 ) " " "*" "*"        "*"        "*"       " " "*"    "*"    " "
7  ( 1 ) "*" "*" "*"        "*"        "*"       " " "*"    "*"    " "
8  ( 1 ) "*" "*" "*"        "*"        "*"       "*" "*"    "*"    " "

OLS.reduced=lm(bwt~race1+smoke1+ui1+lwt, data=birthwt1, subset=train)
summary(OLS.reduced)


Call:
lm(formula = bwt ~ race1 + smoke1 + ui1 + lwt, data = birthwt1, 
    subset = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1833.59  -469.16    53.27   474.21  1637.89 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3014.559    284.207  10.607  < 2e-16 ***
race1black  -486.211    168.628  -2.883  0.00466 ** 
race1other  -425.129    147.035  -2.891  0.00456 ** 
smoke1yes   -444.405    134.768  -3.298  0.00128 ** 
ui1yes      -516.727    172.780  -2.991  0.00338 ** 
lwt            2.744      1.898   1.446  0.15071    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 679.8 on 120 degrees of freedom
Multiple R-squared:  0.2279,    Adjusted R-squared:  0.1958 
F-statistic: 7.085 on 5 and 120 DF,  p-value: 7.74e-06

anova(OLS.full, OLS.reduced)

Analysis of Variance Table

Model 1: bwt ~ age + lwt + race1 + smoke1 + ptl + ht1 + ui1 + ftv
Model 2: bwt ~ race1 + smoke1 + ui1 + lwt
  Res.Df      RSS Df Sum of Sq     F Pr(>F)
1    116 53418660                          
2    120 55452333 -4  -2033673 1.104 0.3581

vif(OLS.reduced)

           GVIF Df GVIF^(1/(2*Df))
race1  1.218502  2        1.050646
smoke1 1.113303  1        1.055132
ui1    1.042351  1        1.020956
lwt    1.145461  1        1.070262

bptest(OLS.reduced, data=birthwt1)


    studentized Breusch-Pagan test

data:  OLS.reduced
BP = 2.7106, df = 5, p-value = 0.7445

plot(OLS.reduced, which=1)

Based on the resiudal plot and the Breusch-Pagan Test being insignificant, heteroscedasticity does not appear to be a concern. The variance inflation factors indicate that multicollinearity is not a concern either.

Interestingly, non-whites, smokers, and those with a history of UI have smaller children than their counterparts. For each additional pound a mother weighs, the child is 44.5 grams heavier. While the OLS model with all of the variables had more predictive power according to the ANOVA test, and R-squared, the reduced model was selected due to less variables being used. Neither of these models do a decent job of predicting the birth weight. Additional models, including interactive models, did not add any valuable information.

To show how ineffective this model is at prediction, it was used against the test data to show how well it can predict if a child has a low birth weight.

reg.pred = predict(OLS.reduced, birthwt1[-train,])
reg.low=ifelse(reg.pred>2500, 0, 1)
reg.table=table(reg.low, birthwt1$low[-train])
reg.table

       
reg.low  0  1
      0 44  9
      1  5  5

true.negatives.reg = reg.table[1,1]
true.positives.reg = reg.table[2,2]

false.positives.reg = reg.table[2,1]
false.negatives.reg = reg.table[1,2]

all.positives.reg = true.positives.reg+false.negatives.reg
all.negatives.reg = true.negatives.reg+false.positives.reg 

all.obs.reg = all.positives.reg+all.negatives.reg
error.rate.reg = (false.positives.reg+false.negatives.reg)/all.obs.reg

sensitivity.reg=true.positives.reg/all.positives.reg
specificity.reg=true.negatives.reg/all.negatives.reg

results.reg = c(error.rate.reg, sensitivity.reg, specificity.reg)
names(results.reg) = c("Error Rate", "Sensitivity", "Specificity")
results.reg

 Error Rate Sensitivity Specificity 
  0.2222222   0.3571429   0.8979592

There is an acceptable error rate of 38%, but the sensitivity rate is only 16%, meaning it doesn’t do a good job of prediciting which children have a low birth weight.

##Binomial Logistic Regression A better attempt at creating a model was done using binomial logistic regression. The goal is to predict if a child is born with a low birth weight. Similar to the OLS model, Race, weight, smoking, and UI showed to be significant. ::: {.cell}

bi.log = glm(low1~lwt+race1+smoke1+ui1, data=birthwt1, subset=train, 
             family=binomial(link="logit"))
summary(bi.log)


Call:
glm(formula = low1 ~ lwt + race1 + smoke1 + ui1, family = binomial(link = "logit"), 
    data = birthwt1, subset = train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.489227   0.977293  -0.501  0.61666   
lwt         -0.011066   0.006699  -1.652  0.09856 . 
race1black   1.205696   0.575797   2.094  0.03626 * 
race1other   1.203193   0.523947   2.296  0.02165 * 
smoke1yes    1.432182   0.475847   3.010  0.00261 **
ui1yes       0.899092   0.549993   1.635  0.10211   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 164.24  on 125  degrees of freedom
Residual deviance: 142.33  on 120  degrees of freedom
AIC: 154.33

Number of Fisher Scoring iterations: 4

log.odds = coef(bi.log)
odds <- exp(log.odds)
prob = odds/(1+odds)
cbind(log.odds, odds, prob)

               log.odds      odds      prob
(Intercept) -0.48922734 0.6130999 0.3800756
lwt         -0.01106614 0.9889949 0.4972335
race1black   1.20569636 3.3390835 0.7695366
race1other   1.20319302 3.3307351 0.7690923
smoke1yes    1.43218211 4.1878275 0.8072411
ui1yes       0.89909172 2.4573701 0.7107628

::: Non-whites, smokers, or those with a history of UI are more likely to have children with a low birth weight than their respective counterparts while heavier women are not. To further evaluate how this model does at prediction, a classification matrix was creatred.

log.pred = predict(bi.log, birthwt1[-train,], type="response")
pred.log.low = ifelse(log.pred>0.5, 1, 0)
log.table=table(pred.log.low, birthwt1$low[-train])
true.negatives = log.table[1,1]
true.positives = log.table[2,2]

false.positives = log.table[2,1]
false.negatives = log.table[1,2]

all.positives = true.positives+false.negatives 
all.negatives = true.negatives+false.positives 

all.obs = all.positives+all.negatives


error.rate.log = (false.positives+false.negatives)/all.obs

sensitivity.log=true.positives/all.positives
specificity.log=true.negatives/all.negatives

results.log = c(error.rate.log, sensitivity.log, specificity.log)
names(results.log) = c("Error Rate", "Sensitivity", "Specificity")
results.log

 Error Rate Sensitivity Specificity 
  0.2698413   0.3571429   0.8367347

rbind(results.reg, results.log)

            Error Rate Sensitivity Specificity
results.reg  0.2222222   0.3571429   0.8979592
results.log  0.2698413   0.3571429   0.8367347

In comparison to the OLS model, this appears to be a worse job of prediction overall. The error rate is higher while both the sensitivity and specificity rates are lower.

##Classification Tree Another attempt involving a classification tree was used to determine if a child had a low birth weight. The full tree indicated an entire node where the results were “no” regardless of the path. As a result, the tree was pruned. ::: {.cell}

library(tree)
class.tree = tree(low1~age+lwt+race1+smoke1+ptl+ht1+ui1+ftv, data=birthwt1, subset=train)
plot(class.tree)
text(class.tree,pretty=0)

prune.classtree.fit= prune.misclass(class.tree,best=5)
plot(prune.classtree.fit)
text(prune.classtree.fit, pretty=0)

:::

Compared to the previous models, the most important variable appears to be the number of premature labors followed by pre-pregnancy weight then age and a history of UI. Interestingly, race and smoking doesn’t appear to have an effect. Using the pruned tree, an attempt was made to predict if a child has a low birth weight.

prune.predict=predict(prune.classtree.fit, birthwt1[-train,], type="class")
prune.predict.table=table(prune.predict, birthwt1$low[-train])
prune.predict.table

             
prune.predict  0  1
          no  40 10
          yes  9  4

true.negatives.prune.predict = prune.predict.table[1,1]
true.positives.prune.predict = prune.predict.table[2,2]

false.positives.prune.predict = prune.predict.table[2,1]
false.negatives.prune.predict = prune.predict.table[1,2]

all.positives.prune.predict = true.negatives.prune.predict+false.negatives.prune.predict 
all.negatives.prune.predict = true.negatives.prune.predict+false.positives.prune.predict 

all.obs.prune.predict = all.positives.prune.predict+all.negatives.prune.predict


error.rate.prune.predict = (false.positives.prune.predict+false.negatives.prune.predict)/all.obs.prune.predict

sensitivity.prune.predict=true.positives.prune.predict/all.positives.prune.predict
specificity.prune.predict=true.negatives.prune.predict/all.negatives.prune.predict

results.prune.predict = c(error.rate.prune.predict, sensitivity.prune.predict, specificity.prune.predict)
names(results.prune.predict) = c("Error Rate", "Sensitivity", "Specificity")
results.prune.predict

 Error Rate Sensitivity Specificity 
  0.1919192   0.0800000   0.8163265

rbind(results.reg, results.log, results.prune.predict)

                      Error Rate Sensitivity Specificity
results.reg            0.2222222   0.3571429   0.8979592
results.log            0.2698413   0.3571429   0.8367347
results.prune.predict  0.1919192   0.0800000   0.8163265

While the predictions overall are very accurate and it does a great job of predicting which children won’t have a low birth weight, it does a poor job of indicating which children do have a low birth weight. Boosting and bagging trees were also attempted, but lead to similar results.

Conclusion

Based on most of the models, race, smoking, and UI presence are useful predictors for if a child is born with a low birth rate. As initially anticipated, there are other factors which can determine a child’s weight that was not presented in the data. For as bad as smoking is, alcohol and illegal drugs are more harmful to fetal development. Another interesting variable would have been household income to determine if there is a disparity between poorer and wealthier households. It is unknown how many of these women were married or single.

hist(birthwt$age, col="cyan", main="Age Histogram", xlab="Mother's Age")

One somewhat shocking surprise was that age was not significant in most of the models. Typically, younger women have to be more viligant on prenatal car because their bodies are less likely to handle a child. There doesn’t appear to be anything unusual about the data on this variable.

The data, and as a result the model, is restricted to only the Baystate Medical Center. A further study of other hospitals in the area and outside the area is needed to come to any major conclusions. However, it is useful as an initial insight into a major health concern.