Specifying a survey design

Survey designs are specified using the svydesign function. The main arguments to the the function are id to specify sampling units (PSUs and optionally later stages), strata to specify strata, weights to specify sampling weights, and fpc to specify finite population size corrections. These arguments should be given as formulas, referring to columns in a data frame given as the data argument.

The resulting survey design object contains all the data and meta-data needed for analysis, and will be supplied as an argument to analysis functions.

The survey package contains several subsamples from the California Academic Performance Index, in the api data set. First, we load these data:
  data(api)
The apistrat data frame has stratified independent sample
dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
stratified on stype, with sampling weights pw. The fpc variable contains the population size for the stratum. As the schools are sampled independently, each record in the data frame is a separate PSU. This is indicated by id=~1. Since the sampling weights could have been determined from the population size an equivalent declaration would be
dstrat <- svydesign(id=~1,strata=~stype,  data=apistrat, fpc=~fpc)

The apiclus1 data frame is a cluster sample: all schools in a random sample of school districts.

dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
There is no strata argument as the sampling was not stratified. The variable dnum identifies school districts (PSUs) and is specified as the id argument. Again, the weights argument is optional, as the sampling weights can be computed from the population size. To specify sampling with replacement, simply omit the fpc argument:
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1)

A design may have strata and clusters. In that case svydesign assumes that the clusters are numbered uniquely across the entire sample, rather than just within a stratum. This enables some sorts of data errors to be detected. If your clusters are only numbered uniquely within a stratum use the option nest=TRUE to specify this and disable the checking.

The apiclus2 data set contains a two-stage cluster sample. First, school districts were sampled. If there were fewer than five schools in the district, all were taken, otherwise a random sample of five.

dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)
The multistage nature of the sampling is clear in the id and fpc arguments. At the first stage the sampling units are identified by dnum and the population size by fpc1. At the second stage, units within each school district are identified by snum and the number of units within the district by fpc2. When a finite population correction is not given, and sampling is with replacement, only the first stage of the design is needed. The following two declarations are equivalent for treating the two-stage cluster design as if the first stage were with replacement.
dclus2wr <- svydesign(id=~dnum+snum, weights=~pw, data=apiclus2)
dclus2wr2 <- svydesign(id=~dnum, weights=~pw, data=apiclus2)

Thomas Lumley
Last modified: Thu May 19 09:08:46 PDT 2005