Stratification within PSUs

The National Health Interview Survey (NHIS) 1987-1994 used a multistage sampling design with stratification within PSUs. Data are available for 1992 from NCHS and for 1991 from the Survey Documentation and Analysis site at Berkeley. NCHS gives guidelines on how to analyze these data using SUDAAN. We will do the same analyses in R. We load the "survey" package and the data (downloaded from SDA). Since there are some second-stage strata with only one unit sampled we set options(survey.lonely.psu) to permit this.

> library(survey)
> options(survey.lonely.psu="remove")
> nhis<-read.dta("nhis.dta")

The analysis recommended for packages other than SUDAAN uses pseudo-stratum and pseudo-psu variables that we must create before creating the design object.

> nhis$cstratum<-nhis$psupseud %/% 10
> nhis$cpsu<-nhis$psupseud %% 10
>
> dnhis1<-svydesign(id = ~cpsu, strata = ~cstratum, weight = ~wtfa,
+     data = nhis,nest=TRUE)

For the full multistage design object we need to create variables that describe the finite population corrections. Data are not provided on population sizes, so we pretend that the non-certainty PSUs are sampled with replacement. We could also use a small number such as 0.01 rather than 0. Since we are specifying the sampling weights separately this would not have much impact. Of course, if we were computing the sampling weights from the population sizes it would be critical to get them right.

> nhis$selfrep<-ifelse(nhis$psutype =="MSA- self-representing",1,0)
> nhis$zero<-rep(0,nrow(nhis))
> dnhis2<-svydesign(id = ~psupseud + segnum, strata = ~stratum + subtype,
+                   weight = ~wtfa, data = nhis,
+                   fpc = ~selfrep+zero)
Warning message:
FPC implies population larger than ten billion. in: as.fpc(fpc, strata, ids)

We can now compute means for a few variables to compare the two sets of estimates. The multistage design gives slightly smaller standard errors. It is also distinctly slower to run (7.5s vs 1.6s, still not bad for a national-scale survey).

> svymean(~height+weight+nacute+ncond, dnhis1, na.rm=TRUE)
             mean     SE
height  67.306288 0.0253
weight 161.112444 0.2017
nacute   0.063934 0.0015
ncond    0.604453 0.0083
>
> svymean(~height+weight+nacute+ncond, dnhis2, na.rm=TRUE)
             mean     SE
height  67.306288 0.0235
weight 161.112444 0.1879
nacute   0.063934 0.0014
ncond    0.604453 0.0069

Thomas Lumley

Last modified: Tue May 10 15:40:27 PDT 2005