Stratification within PSUs
The National Health Interview Survey (NHIS) 1987-1994 used a
multistage sampling design with stratification within PSUs. Data are
available for 1992 from NCHS
and for 1991 from the Survey Documentation
and Analysis site at Berkeley.
NCHS gives guidelines
on how to analyze these data using SUDAAN. We will do the same
analyses in R. We load the "survey" package and the data (downloaded
from SDA). Since there are some second-stage strata with only one unit
sampled we set options(survey.lonely.psu) to permit this.
> library(survey)
> options(survey.lonely.psu="remove")
> nhis<-read.dta("nhis.dta")
The analysis recommended for packages other than SUDAAN uses pseudo-stratum and pseudo-psu variables that we must create before creating the design object.
> nhis$cstratum<-nhis$psupseud %/% 10
> nhis$cpsu<-nhis$psupseud %% 10
>
> dnhis1<-svydesign(id = ~cpsu, strata = ~cstratum, weight = ~wtfa,
+ data = nhis,nest=TRUE)
For the full multistage design object we need to create variables that
describe the finite population corrections. Data are not provided on
population sizes, so we pretend that the non-certainty PSUs are
sampled with replacement. We could also use a small number such as
0.01 rather than 0. Since we are specifying the sampling weights
separately this would not have much impact. Of course, if we were
computing the sampling weights from the population sizes it would be
critical to get them right.
> nhis$selfrep<-ifelse(nhis$psutype =="MSA- self-representing",1,0)
> nhis$zero<-rep(0,nrow(nhis))
> dnhis2<-svydesign(id = ~psupseud + segnum, strata = ~stratum + subtype,
+ weight = ~wtfa, data = nhis,
+ fpc = ~selfrep+zero)
Warning message:
FPC implies population larger than ten billion. in: as.fpc(fpc, strata, ids)
We can now compute means for a few variables to compare the two sets
of estimates. The multistage design gives slightly smaller standard
errors. It is also distinctly slower to run (7.5s vs 1.6s, still not
bad for a national-scale survey).
> svymean(~height+weight+nacute+ncond, dnhis1, na.rm=TRUE)
mean SE
height 67.306288 0.0253
weight 161.112444 0.2017
nacute 0.063934 0.0015
ncond 0.604453 0.0083
>
> svymean(~height+weight+nacute+ncond, dnhis2, na.rm=TRUE)
mean SE
height 67.306288 0.0235
weight 161.112444 0.1879
nacute 0.063934 0.0014
ncond 0.604453 0.0069
Thomas Lumley
Last modified: Tue May 10 15:40:27 PDT 2005