twophase {survey}  R Documentation 
In a twophase design a sample is taken from a population and a subsample taken from the sample, typically stratified by variables not known for the whole population. The second phase can use any design supported for singlephase sampling. The first phase must currently be onestage element or cluster sampling
twophase(id, strata = NULL, probs = NULL, weights = NULL, fpc = NULL, subset, data, method=c("full","approx","simple")) twophasevar(x,design) twophase2var(x,design)
id 
list of two formulas for sampling unit identifiers 
strata 
list of two formulas (or NULL s) for stratum identifies 
probs 
list of two formulas (or NULL s) for sampling probabilities 
weights 
Only for method="approx" , list of two formulas (or NULL s) for sampling weights 
fpc 
list of two formulas (or NULL s) for finite
population corrections 
subset 
formula specifying which observations are selected in phase 2 
data 
Data frame will all data for phase 1 and 2 
method 
"full" requires (much) more memory, but gives unbiased
variance estimates for general multistage designs at both phases.
"simple" or "approx" uses the standard error calculation from
version 3.14 and earlier, which uses much less memory and is correct for designs with simple
random sampling at phase one and stratified random sampling at phase two.

x 
probabilityweighted estimating functions 
design 
twophase design 
The population for the second phase is the firstphase sample. If the
second phase sample uses stratified (multistage cluster) sampling
without replacement and all the stratum and sampling unit identifier
variables are available for the whole firstphase sample it is
possible to estimate the sampling probabilities/weights and the
finite population correction. These would then be specified as
NULL
.
Twophase casecontrol and casecohort studies in biostatistics will typically have simple random sampling with replacement as the first stage. Variances given here may differ slightly from those in the biostatistics literature where a modelbased estimator of the firststage variance would typically be used.
Variance computations are based on the conditioning argument in
Section 9.3 of Sarndal et al. Method "full"
corresponds exactly
to the formulas in that reference. Method "simple"
or
"approx"
(the two are the same) uses less time and memory but
is exact only for some special cases. The most important special case
is the twophase epidemiologic designs where phase 1 is simple random
sampling from an infinite population and phase 2 is stratified random
sampling. See the tests
directory for a worked example. The
only disadvantage of method="simple" in these cases is that
standardization of margins (marginpred
) is not available.
For method="full"
, sampling probabilities must be available for
each stage of sampling, within each phase. For multistage sampling
this requires specifying either fpc
or probs
as a
formula with a term for each stage of sampling. If no fpc
or
probs
are specified at phase 1 it is treated as simple random
sampling from an infinite population, and population totals will not
be correctly estimated, but means, quantiles, and regression models
will be correct.
twophase
returns an object of class twophase2
(for
method="full"
) or twophase
. The structure of
twophase2
objects may change as unnecessary components are removed.
twophase2var
and twophasevar
return a variance matrix with an attribute
containing the separate phase 1 and phase 2 contributions to the variance.
Sarndal CE, Swensson B, Wretman J (1992) "Model Assisted Survey Sampling" Springer.
Breslow NE and Chatterjee N, Design and analysis of twophase studies with binary outcome applied to Wilms tumour prognosis. "Applied Statistics" 48:45768, 1999
Breslow N, Lumley T, Ballantyne CM, Chambless LE, Kulick M. (2009) Improved HorvitzThompson estimation of model parameters from twophase stratified samples: applications in epidemiology. Statistics in Biosciences. doi 10.1007/s1256100990016
Lin, DY and Ying, Z (1993). Cox regression with incomplete covariate measurements. "Journal of the American Statistical Association" 88: 13411349.
svydesign
, svyrecvar
for multi*stage*
sampling
calibrate
for calibration (GREG) estimators.
estWeights
for twophase designs for missing data.
The "epi" and "phase1" vignettes for examples and technical details.
## twophase simple random sampling. data(pbc, package="survival") pbc$randomized<with(pbc, !is.na(trt) & trt>0) pbc$id<1:nrow(pbc) d2pbc<twophase(id=list(~id,~id), data=pbc, subset=~randomized) svymean(~bili, d2pbc) ## twostage sampling as twophase data(mu284) ii<with(mu284, c(1:15, rep(1:5,n2[1:5]3))) mu284.1<mu284[ii,] mu284.1$id<1:nrow(mu284.1) mu284.1$sub<rep(c(TRUE,FALSE),c(15,3415)) dmu284<svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284) ## first phase cluster sample, second phase stratified within cluster d2mu284<twophase(id=list(~id1,~id),strata=list(NULL,~id1), fpc=list(~n1,NULL),data=mu284.1,subset=~sub) svytotal(~y1, dmu284) svytotal(~y1, d2mu284) svymean(~y1, dmu284) svymean(~y1, d2mu284) ## casecohort design: this example requires R 2.2.0 or later library("survival") data(nwtco) ## stratified on case status dcchs<twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort  rel), data=nwtco) svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=dcchs) ## Using survival::cch subcoh < nwtco$in.subcohort selccoh < with(nwtco, rel==1subcoh==1) ccoh.data < nwtco[selccoh,] ccoh.data$subcohort < subcoh[selccoh] cch(Surv(edrel, rel) ~ factor(stage) + factor(histol) + I(age/12), data =ccoh.data, subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing") ## twophase casecontrol ## Similar to Breslow & Chatterjee, Applied Statistics (1999) but with ## a slightly different version of the data set nwtco$incc2<as.logical(with(nwtco, ifelse(rel  instit==2,1,rbinom(nrow(nwtco),1,.1)))) dccs2<twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,instit)), data=nwtco, subset=~incc2) dccs8<twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)), data=nwtco, subset=~incc2) summary(glm(rel~factor(stage)*factor(histol),data=nwtco,family=binomial())) summary(svyglm(rel~factor(stage)*factor(histol),design=dccs2,family=quasibinomial())) summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8,family=quasibinomial())) ## Stratification on stage is really poststratification, so we should use calibrate() gccs8<calibrate(dccs2, phase=2, formula=~interaction(rel,stage,instit)) summary(svyglm(rel~factor(stage)*factor(histol),design=gccs8,family=quasibinomial())) ## For this saturated model calibration is equivalent to estimating weights. pccs8<calibrate(dccs2, phase=2,formula=~interaction(rel,stage,instit), calfun="rrz") summary(svyglm(rel~factor(stage)*factor(histol),design=pccs8,family=quasibinomial())) ## Since sampling is SRS at phase 1 and stratified RS at phase 2, we ## can use method="simple" to save memory. dccs8_simple<twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)), data=nwtco, subset=~incc2,method="simple") summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8_simple,family=quasibinomial()))