calibrate.Rd
Calibration, generalized raking, or GREG estimators generalise post-stratification and
raking by calibrating a sample to the marginal totals of
variables in a linear regression model. This function reweights the
survey design and adds additional information that is used by
svyrecvar
to reduce the estimated standard errors.
calibrate(design,...)
# S3 method for survey.design2
calibrate(design, formula, population,
aggregate.stage=NULL, stage=0, variance=NULL,
bounds=c(-Inf,Inf), calfun=c("linear","raking","logit"),
maxit=50,epsilon=1e-7,verbose=FALSE,force=FALSE,trim=NULL,
bounds.const=FALSE, sparse=FALSE,...)
# S3 method for svyrep.design
calibrate(design, formula, population,compress=NA,
aggregate.index=NULL, variance=NULL, bounds=c(-Inf,Inf),
calfun=c("linear","raking","logit"),
maxit=50, epsilon=1e-7, verbose=FALSE,force=FALSE,trim=NULL,
bounds.const=FALSE, sparse=FALSE,...)
# S3 method for twophase
calibrate(design, phase=2,formula, population,
calfun=c("linear","raking","logit","rrz"),...)
grake(mm,ww,calfun,eta=rep(0,NCOL(mm)),bounds,population,epsilon,
verbose,maxit,variance=NULL)
cal_names(formula,design,...)
Survey design object
Model formula for calibration model, or list of formulas for each margin
Vectors of population column totals for the model matrix in the calibration model, or list of such vectors for each cluster, or list of tables for each margin. Required except for two-phase designs
compress the resulting replicate weights if
TRUE
or if NA
and weights were previously compressed
See Details below
Coefficients for variance in calibration model (heteroskedasticity parameters) (see Details below)
An integer. If not NULL
, make calibration weights
constant within sampling units at this stage.
A vector or one-sided formula. If not NULL
, make calibration weights
constant within levels of this variable
Bounds for the calibration weights, optional
except for calfun="logit"
Should be TRUE
if bounds
have been spcified as constant values rather than multiplicative values
Weights outside this range will be trimmed to these bounds.
Options for other methods
Calibration function: see below
Number of iterations
Tolerance in matching population total. Either a single
number or a vector of the same length as population
Print lots of uninteresting information
Return an answer even if the specified accuracy was not achieved
Phase of a two-phase design to calibrate (only
phase=2
currently implemented.)
Model matrix
Vector of weights
Starting values for iteration
Use sparse matrices for faster computation
The formula
argument specifies a model matrix, and the
population
argument is the population column sums of this
matrix. The function cal_names
shows what the column names of
this model matrix will be.
For the important special case where the calibration totals are (possibly
overlapping) marginal tables of factor variables, as in classical
raking, the formula
and population
arguments may be
lists in the same format as the input to rake
.
If the population
argument has a names attribute it will be
checked against the names produced by model.matrix(formula)
and
reordered if necessary. This protects against situations where the
(locale-dependent) ordering of factor levels is not what you expected.
Numerical instabilities may result if the sampling weights in the
design
object are wrong by multiple orders of magnitude. The
code now attempts to rescale the weights first, but it is better for
the user to ensure that the scale is reasonable.
The calibrate
function implements linear, bounded linear,
raking, bounded raking, and logit calibration functions. All except
unbounded linear calibration use the Newton-Raphson algorithm
described by Deville et al (1993). This algorithm is exposed for other
uses in the grake
function. Unbounded linear calibration uses
an algorithm that is less sensitive to collinearity. The calibration
function may be specified as a string naming one of the three built-in
functions or as an object of class calfun
, allowing
user-defined functions. See make.calfun
for details.
The bounds
argument can be specified as global upper and lower bounds e.g
bounds=c(0.5, 2)
or as a list with lower and upper vectors e.g.
bounds=list(lower=lower, upper=upper)
. This allows for individual
boundary constraints for each unit. The lower and upper vectors must be
the same length as the input data. The bounds can be specified as multiplicative
values or constant values. If constant, bounds.const
must be set to TRUE
.
Calibration with bounds, or on highly collinear data, may fail. If
force=TRUE
the approximately calibrated design object will
still be returned (useful for examining why it failed). A failure in
calibrating a set of replicate weights when the sampling weights were
successfully calibrated will give only a warning, not an error.
When calibration to the desired set of bounds is not possible, another option is
to trim weights. To do this set bounds
to a looser set of bounds
for which calibration is achievable and set trim
to the tighter
bounds. Weights outside the bounds will be trimmed to the bounds, and
the excess weight distributed over other observations in proportion to
their sampling weight (and so this may put some other observations
slightly over the trimming bounds). The projection matrix used in computing
standard errors is based on the feasible bounds specified by the
bounds
argument. See also trimWeights
,
which trims the final weights in a design object rather than the
calibration adjustments.
For two-phase designs calfun="rrz"
estimates the sampling
probabilities using logistic regression as described by Robins et al
(1994). estWeights
will do the same thing.
Calibration may result in observations within the last-stage sampling
units having unequal weight even though they necessarily are sampled
together. Specifying aggegrate.stage
ensures that the
calibration weight adjustments are constant within sampling units at
the specified stage; if the original sampling weights were equal the
final weights will also be equal. The algorithm is as described by
Vanderhoeft (2001, section III.D). Specifying aggregate.index
does the same thing for replicate weight designs; a warning will be
given if the original weights are not constant within levels of
aggregate.index
.
In a model with two-stage sampling, population totals may be available
for the PSUs actually sampled, but not for the whole population. In
this situation, calibrating within each PSU reduces with second-stage
contribution to variance. This generalizes to multistage sampling.
The stage
argument specifies which stage of sampling the totals
refer to. Stage 0 is full population totals, stage 1 is totals for
PSUs, and so on. The default, stage=NULL
is interpreted as
stage 0 when a single population vector is supplied and stage 1 when a
list is supplied. Calibrating to PSU totals will fail (with a message
about an exactly singular matrix) for PSUs that have fewer
observations than the number of calibration variables.
The variance in the calibration model may depend on covariates. If variance=NULL
the
calibration model has constant variance. If variance
is not NULL
it specifies a linear combination of the columns of the model matrix
and the calibration variance is proportional to that linear combination.
Alternatively variance
can be specified as a vector of values the
same length as the input data specifying a heteroskedasticity parameter
for each unit.
The design matrix specified by formula (after any aggregation) must be of full rank, with one exception. If the population total for a column is zero and all the observations are zero the column will be ignored. This allows the use of factors where the population happens to have no observations at some level.
In a two-phase design, population
may be omitted when
phase=2
, to specify calibration to the phase-one sample. If the
two-phase design object was constructed using the more memory-efficient
method="approx"
argument to twophase
, calibration of the first
phase of sampling to the population is not supported.
A survey design object.
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol. 2009;169(11):1398-1405. doi:10.1093/aje/kwp055
Deville J-C, Sarndal C-E, Sautory O (1993) Generalized Raking Procedures in Survey Sampling. JASA 88:1013-1020
Kalton G, Flores-Cervantes I (2003) "Weighting methods" J Official Stat 19(2) 81-97
Lumley T, Shaw PA, Dai JY (2011) "Connections between survey calibration estimators and semiparametric models for incomplete data" International Statistical Review. 79:200-220. (with discussion 79:221-232)
Sarndal C-E, Swensson B, Wretman J. "Model Assisted Survey Sampling". Springer. 1991.
Rao JNK, Yung W, Hidiroglou MA (2002) Estimating equations for the analysis of survey data using poststratification information. Sankhya 64 Series A Part 2, 364-378.
Robins JM, Rotnitzky A, Zhao LP. (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846-866.
Vanderhoeft C (2001) Generalized Calibration at Statistics Belgium. Statistics Belgium Working Paper No 3.
postStratify
, rake
for other ways
to use auxiliary information
twophase
and vignette("epi")
for an example of calibration in two-phase designs
survey/tests/kalton.R
for examples replicating those in Kalton & Flores-Cervantes (2003)
make.calfun
for user-defined calibration distances.
trimWeights
to trim final weights rather than calibration adjustments.
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
cal_names(~stype, dclus1)
#> [1] "(Intercept)" "stypeH" "stypeM"
pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018)
## For a single factor variable this is equivalent to
## postStratify
(dclus1g<-calibrate(dclus1, ~stype, pop.totals))
#> 1 - level Cluster Sampling design
#> With (15) clusters.
#> calibrate(dclus1, ~stype, pop.totals)
svymean(~api00, dclus1g)
#> mean SE
#> api00 642.31 23.92
svytotal(~enroll, dclus1g)
#> total SE
#> enroll 3680893 406293
svytotal(~stype, dclus1g)
#> total SE
#> stypeE 4421 0
#> stypeH 755 0
#> stypeM 1018 0
## Make weights constant within school district
(dclus1agg<-calibrate(dclus1, ~stype, pop.totals, aggregate=1))
#> 1 - level Cluster Sampling design
#> With (15) clusters.
#> calibrate(dclus1, ~stype, pop.totals, aggregate = 1)
svymean(~api00, dclus1agg)
#> mean SE
#> api00 640.31 25.71
svytotal(~enroll, dclus1agg)
#> total SE
#> enroll 3313035 268167
svytotal(~stype, dclus1agg)
#> total SE
#> stypeE 4421 0
#> stypeH 755 0
#> stypeM 1018 0
## Now add sch.wide
cal_names(~stype+sch.wide, dclus1)
#> [1] "(Intercept)" "stypeH" "stypeM" "sch.wideYes"
(dclus1g2 <- calibrate(dclus1, ~stype+sch.wide, c(pop.totals, sch.wideYes=5122)))
#> 1 - level Cluster Sampling design
#> With (15) clusters.
#> calibrate(dclus1, ~stype + sch.wide, c(pop.totals, sch.wideYes = 5122))
svymean(~api00, dclus1g2)
#> mean SE
#> api00 641 23.83
svytotal(~enroll, dclus1g2)
#> total SE
#> enroll 3654414 403074
svytotal(~stype, dclus1g2)
#> total SE
#> stypeE 4421 0
#> stypeH 755 0
#> stypeM 1018 0
## Finally, calibrate on 1999 API and school type
cal_names(~stype+api99, dclus1)
#> [1] "(Intercept)" "stypeH" "stypeM" "api99"
(dclus1g3 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069)))
#> 1 - level Cluster Sampling design
#> With (15) clusters.
#> calibrate(dclus1, ~stype + api99, c(pop.totals, api99 = 3914069))
svymean(~api00, dclus1g3)
#> mean SE
#> api00 665.31 3.4418
svytotal(~enroll, dclus1g3)
#> total SE
#> enroll 3638487 385524
svytotal(~stype, dclus1g3)
#> total SE
#> stypeE 4421 0
#> stypeH 755 0
#> stypeM 1018 0
## Same syntax with replicate weights
rclus1<-as.svrepdesign(dclus1)
(rclus1g3 <- calibrate(rclus1, ~stype+api99, c(pop.totals, api99=3914069)))
#> Call: calibrate(rclus1, ~stype + api99, c(pop.totals, api99 = 3914069))
#> Unstratified cluster jacknife (JK1) with 15 replicates.
svymean(~api00, rclus1g3)
#> mean SE
#> api00 665.31 3.9429
svytotal(~enroll, rclus1g3)
#> total SE
#> enroll 3638487 478749
svytotal(~stype, rclus1g3)
#> total SE
#> stypeE 4421 0
#> stypeH 755 0
#> stypeM 1018 0
(rclus1agg3 <- calibrate(rclus1, ~stype+api99, c(pop.totals,api99=3914069), aggregate.index=~dnum))
#> Call: calibrate(rclus1, ~stype + api99, c(pop.totals, api99 = 3914069),
#> aggregate.index = ~dnum)
#> Unstratified cluster jacknife (JK1) with 15 replicates.
svymean(~api00, rclus1agg3)
#> mean SE
#> api00 666.83 6.5308
svytotal(~enroll, rclus1agg3)
#> total SE
#> enroll 3238699 438251
svytotal(~stype, rclus1agg3)
#> total SE
#> stypeE 4421 0
#> stypeH 755 0
#> stypeM 1018 0
###
## Bounded weights
range(weights(dclus1g3)/weights(dclus1))
#> [1] 0.4185925 1.8332949
dclus1g3b <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069),bounds=c(0.6,1.6))
range(weights(dclus1g3b)/weights(dclus1))
#> [1] 0.6 1.6
svymean(~api00, dclus1g3b)
#> mean SE
#> api00 665.48 3.4184
svytotal(~enroll, dclus1g3b)
#> total SE
#> enroll 3662213 378691
svytotal(~stype, dclus1g3b)
#> total SE
#> stypeE 4421 0
#> stypeH 755 0
#> stypeM 1018 0
## Individual boundary constraints as constant values
# the first weight will be bounded at 40, the rest free to move
bnds <- list(
lower = rep(-Inf, nrow(apiclus1)),
upper = c(40, rep(Inf, nrow(apiclus1)-1)))
head(weights(dclus1g3))
#> 1 2 3 4 5 6
#> 51.63274 27.03836 25.26785 34.86947 34.25660 35.14186
dclus1g3b1 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069),
bounds=bnds, bounds.const=TRUE)
head(weights(dclus1g3b1))
#> 1 2 3 4 5 6
#> 40.00000 27.04817 25.28241 34.85830 34.24707 35.12996
svytotal(~api.stu, dclus1g3b1)
#> total SE
#> api.stu 3083274 329433
## trimming
dclus1tr <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069),
bounds=c(0.5,2), trim=c(2/3,3/2))
#> 37 weights were trimmed
svymean(~api00+api99+enroll, dclus1tr)
#> mean SE
#> api00 662.73 3.4388
#> api99 628.63 0.0000
#> enroll 583.20 57.5477
svytotal(~stype,dclus1tr)
#> total SE
#> stypeE 4496.71 0
#> stypeH 684.89 0
#> stypeM 1012.40 0
range(weights(dclus1tr)/weights(dclus1))
#> [1] 0.6666667 1.5000000
rclus1tr <- calibrate(rclus1, ~stype+api99, c(pop.totals, api99=3914069),
bounds=c(0.5,2), trim=c(2/3,3/2))
#> 37 weights were trimmed
#> Warning: Failed to converge: eps=0.0392952671757451 in 51 iterations
svymean(~api00+api99+enroll, rclus1tr)
#> mean SE
#> api00 662.73 7.2853
#> api99 628.63 5.5528
#> enroll 583.20 63.0674
svytotal(~stype,rclus1tr)
#> total SE
#> stypeE 4496.71 165.351
#> stypeH 684.89 168.518
#> stypeM 1012.40 56.543
## Input in the same format as rake() for classical raking
pop.table <- xtabs(~stype+sch.wide,apipop)
pop.table2 <- xtabs(~stype+comp.imp,apipop)
dclus1r<-rake(dclus1, list(~stype+sch.wide, ~stype+comp.imp),
list(pop.table, pop.table2))
gclus1r<-calibrate(dclus1, formula=list(~stype+sch.wide, ~stype+comp.imp),
population=list(pop.table, pop.table2),calfun="raking")
svymean(~api00+stype, dclus1r)
#> mean SE
#> api00 642.87442 22.249
#> stypeE 0.71376 0.000
#> stypeH 0.12189 0.000
#> stypeM 0.16435 0.000
svymean(~api00+stype, gclus1r)
#> mean SE
#> api00 642.86932 22.244
#> stypeE 0.71376 0.000
#> stypeH 0.12189 0.000
#> stypeM 0.16435 0.000
## generalised raking
dclus1g3c <- calibrate(dclus1, ~stype+api99, c(pop.totals,
api99=3914069), calfun="raking")
range(weights(dclus1g3c)/weights(dclus1))
#> [1] 0.5342314 1.9947612
(dclus1g3d <- calibrate(dclus1, ~stype+api99, c(pop.totals,
api99=3914069), calfun=cal.logit, bounds=c(0.5,2.5)))
#> 1 - level Cluster Sampling design
#> With (15) clusters.
#> calibrate(dclus1, ~stype + api99, c(pop.totals, api99 = 3914069),
#> calfun = cal.logit, bounds = c(0.5, 2.5))
range(weights(dclus1g3d)/weights(dclus1))
#> [1] 0.5943692 1.9358791
## Ratio estimators are calibration estimators
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
svytotal(~api.stu,dstrat)
#> total SE
#> api.stu 3086009 99477
common<-svyratio(~api.stu, ~enroll, dstrat, separate=FALSE)
predict(common, total=3811472)
#> $total
#> enroll
#> api.stu 3190038
#>
#> $se
#> enroll
#> api.stu 29565.98
#>
pop<-3811472
## equivalent to (common) ratio estimator
dstratg1<-calibrate(dstrat,~enroll-1, pop, variance=1)
svytotal(~api.stu, dstratg1)
#> total SE
#> api.stu 3190038 29566
# Alternatively specifying the heteroskedasticity parameters directly
dstratgh <- calibrate(dstrat,~enroll-1, pop, variance=apistrat$enroll)
svytotal(~api.stu, dstratgh)
#> total SE
#> api.stu 3190038 29566