Post-stratification is done with the postStratify function, for survey designs with or without replicate weights. This example of post-stratification is based on one at UCLA Academic Technology Services, analysing data from Sampling of Populations by Levy and Lemeshow.
First a survey design object is created> dpets <- svydesign(id = ~1, weight = ~weight, data = pets, fpc = ~n)Post-stratification requires data on the population distribution of the stratifying variable. This can be supplied in a data frame or a table, here a data frame is used. The other arguments to postStratify are the survey design object and a model formula specifying the stratifying variables.
> pspets <- postStratify(dpets, ~type, data.frame(type = c(1, 2), Freq = c(850, 450)))The poststratified survey design object can be used in the same way as any other survey design object.For example, to show the precision gained by post-stratification:
> svytotal(~totexp,dpets) total SE totexp 51643 2888.3 > svytotal(~totexp, pspets) total SE totexp 52150 1512.5Raking is a way to approximate post-stratification on a set of variables when only their marginal population distributions are known. Here we work with a one-stage cluster sample from the California Academic Performance Index, with the survey design object rclus1 created in an earlier example. First define data frames giving the population totals for schools by type (elementary, middle, high) and by whether they met their school-wide growth target.
pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018)) pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122))Now apply these to rclus1. The first argument to rake is the survey design object. The second is a list of formulas specifying the variables to rake. The third argument is a list of data frames or tables giving the population distributions of each variable.
rclus1r <- rake(rclus1, list(~stype,~sch.wide), list(pop.types, pop.schwide))Raking can also be used when the joint distribution of some variables is known. Suppose we know the joint population distribution of school type and the "comparative improvement" target
pop.imptype<-with(apipop, table(comp.imp, stype))We can then call rake specifying two margins: sch.wide and the combination of comp.imp and stype
dclus1r <- rake(dclus1, list(~comp.imp+stype,~sch.wide), list(pop.imptype, pop.schwide))Calibration, or GREG estimation, allows continuous as well as discrete auxiliary variables. The method is motivated by regression estimation of a total but can be computed simply by reweighting, using the calibrate() function. The method is described for estimation of a total in Särndal et al Model Assisted Survey Ssampling; we use the same approach to estimate the total of estimating functions.
This example uses the same dclus1 survey design as above. First we calibrate to the school type variable. We supply a formula ~stype and must specify population totals for the design matrix produced by this formula.
pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018) dclus1g<-calibrate(dclus1, ~stype, pop.totals)We could also have given the formula as ~stype-1, with no intercept, in which case the population totals would have been simply the three stratum sizes.
Calibration to a single factor variable is equivalent to post-stratification, as we see from these examples:
> svymean(~api00, dclus1g) mean SE api00 642.31 23.921 > svytotal(~enroll, dclus1g) total SE enroll 3680893 406293 > svytotal(~stype, dclus1g) total SE stypeE 4421 2.308e-14 stypeH 755 5.985e-15 stypeM 1018 5.616e-14We can calibrate using the two variables stype and sch.wide, as we did with raking above.
> dclus1g2 <- calibrate(dclus1, ~stype+sch.wide, c(pop.totals, sch.wideYes=5122)) > svymean(~api00, dclus1g2) mean SE api00 641 23.829 > svytotal(~enroll, dclus1g2) total SE enroll 3654414 403074 > svytotal(~stype, dclus1g2) total SE stypeE 4421 4.704e-14 stypeH 755 2.921e-15 stypeM 1018 3.821e-14The results of calibration and raking are similar, but not identical.
Finally, we can look at a calibration example with a continuous variable. Imagine that the 1999 API is known for all schools but the 2000 API is known only for the sample. Since there is a high correlation between 1999 and 2000 API (see the graphics example for a scatterplot), calibrating to 1999 API reduces the uncertainty in mean 2000 API considerably.
> svymean(~api00, dclus1g3) mean SE api00 665.31 3.4418 > svytotal(~enroll, dclus1g3) total SE enroll 3638487 385524 > svytotal(~stype, dclus1g3) total SE stypeE 4421 4.376e-14 stypeH 755 2.369e-14 stypeM 1018 2.140e-14The reduction in uncertainty is less impressive for a subpopulation, as knowing the population total does not give complete information about the subpopulation total:
> svymean(~api00, subset(dclus1g3, comp.imp=="Yes")) mean SE api00 672.05 6.5182
These examples have used a calibration working model with constant variance. Using a model with variance proportional to the covariate it is possible to express the standard ratio estimator of a total as a calibration estimator
> dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, + data = apistrat, fpc = ~fpc) > ratio <- svyratio(~api.stu, ~enroll, dstrat, separate = FALSE) > predict(ratio, total=3811472) $total enroll api.stu 3190038 $se enroll api.stu 29565.98 > dstratg <- calibrate(dstrat, ~enroll-1, pop=3811472, lambda=1) > svytotal(~api.stu, dstratg) total SE api.stu 3190038 29566The parameter lambda specifies a linear combination of the predictors that is proportional to the variance. With a single predictor any non-NULL value is equivalent.
Two refinements of regression calibration are the ability to impose bounds ont he calibration weights and to require that the weights are constant within sampling units.
In the example above, the changes in weights needed for calibration were substantial
> range(weights(dclus1g3)/weights(dclus1)) [1] 0.4185925 1.8332949The original sampling weights were decreased by as much as 60% or increased by over 80%. We can attempt a bounded regression calibration, restricting the calibration weights to be positive and less than 1.5
> (dclus1g3b <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=c(0,1.5))) 1 - level Cluster Sampling design With (15) clusters. calibrate(dclus1, ~stype + api99, c(pop.totals, api99 = 3914069), bounds = c(0, 1.5)) > range(weights(dclus1g3b)/weights(dclus1)) [1] 0.4970466 1.5000000The bounds cannot be specified arbitrarily. Attempting a lower bound of 0.5 instead of 0 fails.
> (dclus1g3b <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=c(0.5,1.5))) Error in calibrate.survey.design2(dclus1, ~stype + api99, c(pop.totals, : Failed to achieve bounds in 11 iterations
Another useful restriction is to make weights constant within sampling units. For example, in a survey that samples households and collects data on all individuals in a household the true probability of being sampled must be identical for all individuals in the household and it is desirable for the sampling weight to also be identical.
In the one-stage cluster sample we have been using, all schools in a random sample of school districts are used, so the weights should be equal within school districts. Specifying aggregage.stage=1
requests that weights are constant within sampling units at the first stage of sampling.
> (dclus1agg<-calibrate(dclus1, ~stype, pop.totals, aggregate.stage=1)) 1 - level Cluster Sampling design With (15) clusters. calibrate(dclus1, ~stype, pop.totals, aggregate.stage = 1)The extra restrictions on weights will tend to mean that larger weights are needed to achieve calibration
> range(weights(dclus1agg)/weights(dclus1)) [1] 0.3234162 2.2828519 > range(weights(dclus1g)/weights(dclus1)) [1] 0.907064 1.593304For designs specified by replicate weights the sampling units may be unknown. Rather than specifying
aggregate.stage
, specify a
formula or vector of unit identifiers as the
aggregate.index
argument.
> (rclus1agg3 <- calibrate(rclus1, ~stype+api99, c(pop.totals,api99=3914069), aggregate.index=~dnum)) Call: calibrate(rclus1, ~stype + api99, c(pop.totals, api99 = 3914069), aggregate.index = ~dnum) Unstratified cluster jacknife (JK1) with 15 replicates.Generalized raking
calibrate has an argument calfun to specify the calibration function: linear, raking ratio, logit. An number of European agencies prefer logit calibration for producing fewer extreme weights
For example, we can perform logit calibration of the one-stage cluster sample:
> dclus1g3d <- calibrate(dclus1, ~stype + api99, c(pop.totals, + api99 = 3914069), calfun = "logit", bounds = c(0.5, 2.5)) > summary(weights(dclus1g3d)/weights(dclus1)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.5944 0.7487 0.9473 1.0000 1.1740 1.9360