In a two-phase design a sample is taken from a population and a subsample taken from the sample, typically stratified by variables not known for the whole population. The second phase can use any design supported for single-phase sampling. The first phase must currently be one-stage element or cluster sampling

twophase(id, strata = NULL, probs = NULL, weights = NULL, fpc = NULL,
subset, data, method=c("full","approx","simple"))
twophasevar(x,design)
twophase2var(x,design)

Arguments

id

list of two formulas for sampling unit identifiers

strata

list of two formulas (or NULLs) for stratum identifies

probs

list of two formulas (or NULLs) for sampling probabilities

weights

Only for method="approx", list of two formulas (or NULLs) for sampling weights

fpc

list of two formulas (or NULLs) for finite population corrections

subset

formula specifying which observations are selected in phase 2

data

Data frame will all data for phase 1 and 2

method

"full" requires (much) more memory, but gives unbiased variance estimates for general multistage designs at both phases. "simple" or "approx" uses the standard error calculation from version 3.14 and earlier, which uses much less memory and is correct for designs with simple random sampling at phase one and stratified random sampling at phase two.

x

probability-weighted estimating functions

design

two-phase design

Details

The population for the second phase is the first-phase sample. If the second phase sample uses stratified (multistage cluster) sampling without replacement and all the stratum and sampling unit identifier variables are available for the whole first-phase sample it is possible to estimate the sampling probabilities/weights and the finite population correction. These would then be specified as NULL.

Two-phase case-control and case-cohort studies in biostatistics will typically have simple random sampling with replacement as the first stage. Variances given here may differ slightly from those in the biostatistics literature where a model-based estimator of the first-stage variance would typically be used.

Variance computations are based on the conditioning argument in Section 9.3 of Sarndal et al. Method "full" corresponds exactly to the formulas in that reference. Method "simple" or "approx" (the two are the same) uses less time and memory but is exact only for some special cases. The most important special case is the two-phase epidemiologic designs where phase 1 is simple random sampling from an infinite population and phase 2 is stratified random sampling. See the tests directory for a worked example. The only disadvantage of method="simple" in these cases is that standardization of margins (marginpred) is not available.

For method="full", sampling probabilities must be available for each stage of sampling, within each phase. For multistage sampling this requires specifying either fpc or probs as a formula with a term for each stage of sampling. If no fpc or probs are specified at phase 1 it is treated as simple random sampling from an infinite population, and population totals will not be correctly estimated, but means, quantiles, and regression models will be correct.

Value

twophase returns an object of class twophase2 (for

method="full") or twophase. The structure of

twophase2 objects may change as unnecessary components are removed.

twophase2var and twophasevar return a variance matrix with an attribute containing the separate phase 1 and phase 2 contributions to the variance.

References

Sarndal CE, Swensson B, Wretman J (1992) "Model Assisted Survey Sampling" Springer.

Breslow NE and Chatterjee N, Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. "Applied Statistics" 48:457-68, 1999

Breslow N, Lumley T, Ballantyne CM, Chambless LE, Kulick M. (2009) Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Statistics in Biosciences. doi 10.1007/s12561-009-9001-6

Lin, DY and Ying, Z (1993). Cox regression with incomplete covariate measurements. "Journal of the American Statistical Association" 88: 1341-1349.

See also

svydesign, svyrecvar for multi*stage* sampling

calibrate for calibration (GREG) estimators.

estWeights for two-phase designs for missing data.

The "epi" and "phase1" vignettes for examples and technical details.

Examples

 ## two-phase simple random sampling.
 data(pbc, package="survival")
 pbc$randomized<-with(pbc, !is.na(trt) & trt>0)
 pbc$id<-1:nrow(pbc)
 d2pbc<-twophase(id=list(~id,~id), data=pbc, subset=~randomized)
 svymean(~bili, d2pbc)
#>        mean     SE
#> bili 3.2561 0.2562

 ## two-stage sampling as two-phase
 data(mu284)
 ii<-with(mu284, c(1:15, rep(1:5,n2[1:5]-3)))
 mu284.1<-mu284[ii,]
 mu284.1$id<-1:nrow(mu284.1)
 mu284.1$sub<-rep(c(TRUE,FALSE),c(15,34-15))
 dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284)
 ## first phase cluster sample, second phase stratified within cluster
 d2mu284<-twophase(id=list(~id1,~id),strata=list(NULL,~id1),
                     fpc=list(~n1,NULL),data=mu284.1,subset=~sub)
 svytotal(~y1, dmu284)
#>    total     SE
#> y1 15080 2274.3
 svytotal(~y1, d2mu284)
#>    total     SE
#> y1 15080 2274.3
 svymean(~y1, dmu284)
#>      mean     SE
#> y1 44.353 2.2737
 svymean(~y1, d2mu284)
#>      mean     SE
#> y1 44.353 2.2737

 ## case-cohort design: this example requires R 2.2.0 or later
 library("survival")
 data(nwtco)

 ## stratified on case status
 dcchs<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel),
         subset=~I(in.subcohort | rel), data=nwtco)
 svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=dcchs)
#> Call:
#> svycoxph(formula = Surv(edrel, rel) ~ factor(stage) + factor(histol) + 
#>     I(age/12), design = dcchs)
#> 
#>                    coef exp(coef) se(coef) robust se      z        p
#> factor(stage)2  0.69266   1.99902  0.22688   0.16279  4.255 2.09e-05
#> factor(stage)3  0.62685   1.87171  0.22873   0.16823  3.726 0.000194
#> factor(stage)4  1.29951   3.66751  0.25017   0.18898  6.877 6.13e-12
#> factor(histol)2 1.45829   4.29861  0.16844   0.14548 10.024  < 2e-16
#> I(age/12)       0.04609   1.04717  0.02732   0.02302  2.002 0.045233
#> 
#> Likelihood ratio test=  on 5 df, p=
#> n= 1154, number of events= 571 

 ## Using survival::cch 
 subcoh <- nwtco$in.subcohort
 selccoh <- with(nwtco, rel==1|subcoh==1)
 ccoh.data <- nwtco[selccoh,]
 ccoh.data$subcohort <- subcoh[selccoh]
 cch(Surv(edrel, rel) ~ factor(stage) + factor(histol) + I(age/12), data =ccoh.data,
        subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
#> Case-cohort analysis,x$method, LinYing 
#>  with subcohort of 668 from cohort of 4028 
#> 
#> Call: cch(formula = Surv(edrel, rel) ~ factor(stage) + factor(histol) + 
#>     I(age/12), data = ccoh.data, subcoh = ~subcohort, id = ~seqno, 
#>     cohort.size = 4028, method = "LinYing")
#> 
#> Coefficients:
#>                      Value         SE         Z            p
#> factor(stage)2  0.69265646 0.16287906  4.252581 2.113204e-05
#> factor(stage)3  0.62685179 0.16746144  3.743260 1.816478e-04
#> factor(stage)4  1.29951229 0.18973707  6.849016 7.436052e-12
#> factor(histol)2 1.45829267 0.14429553 10.106291 0.000000e+00
#> I(age/12)       0.04608972 0.02230861  2.066006 3.882790e-02


 ## two-phase case-control
 ## Similar to Breslow & Chatterjee, Applied Statistics (1999) but with
 ## a slightly different version of the data set
 
 nwtco$incc2<-as.logical(with(nwtco, ifelse(rel | instit==2,1,rbinom(nrow(nwtco),1,.1))))
 dccs2<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,instit)),
    data=nwtco, subset=~incc2)
 dccs8<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)),
    data=nwtco, subset=~incc2)
 summary(glm(rel~factor(stage)*factor(histol),data=nwtco,family=binomial()))
#> 
#> Call:
#> glm(formula = rel ~ factor(stage) * factor(histol), family = binomial(), 
#>     data = nwtco)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.5829  -0.5243  -0.5230  -0.3626   2.3472  
#> 
#> Coefficients:
#>                                Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)                     -2.6890     0.1077 -24.958  < 2e-16 ***
#> factor(stage)2                   0.7687     0.1457   5.274 1.34e-07 ***
#> factor(stage)3                   0.7741     0.1508   5.133 2.86e-07 ***
#> factor(stage)4                   1.0422     0.1748   5.964 2.46e-09 ***
#> factor(histol)2                  1.2928     0.2480   5.213 1.86e-07 ***
#> factor(stage)2:factor(histol)2   0.1737     0.3255   0.534  0.59362    
#> factor(stage)3:factor(histol)2   0.6503     0.3175   2.048  0.04056 *  
#> factor(stage)4:factor(histol)2   1.2703     0.3879   3.275  0.00106 ** 
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 3288.0  on 4027  degrees of freedom
#> Residual deviance: 2924.3  on 4020  degrees of freedom
#> AIC: 2940.3
#> 
#> Number of Fisher Scoring iterations: 5
#> 
 summary(svyglm(rel~factor(stage)*factor(histol),design=dccs2,family=quasibinomial()))
#> 
#> Call:
#> svyglm(formula = rel ~ factor(stage) * factor(histol), design = dccs2, 
#>     family = quasibinomial())
#> 
#> Survey design:
#> twophase2(id = id, strata = strata, probs = probs, fpc = fpc, 
#>     subset = subset, data = data)
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     -2.6542     0.1251 -21.218  < 2e-16 ***
#> factor(stage)2                   0.7001     0.1960   3.572 0.000369 ***
#> factor(stage)3                   0.7116     0.2038   3.492 0.000498 ***
#> factor(stage)4                   1.0321     0.2515   4.104 4.34e-05 ***
#> factor(histol)2                  1.1427     0.3170   3.605 0.000326 ***
#> factor(stage)2:factor(histol)2   0.4313     0.4408   0.978 0.328054    
#> factor(stage)3:factor(histol)2   0.7362     0.4256   1.730 0.083981 .  
#> factor(stage)4:factor(histol)2   1.8265     0.4927   3.707 0.000220 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for quasibinomial family taken to be 1.000873)
#> 
#> Number of Fisher Scoring iterations: 4
#> 
 summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8,family=quasibinomial()))
#> 
#> Call:
#> svyglm(formula = rel ~ factor(stage) * factor(histol), design = dccs8, 
#>     family = quasibinomial())
#> 
#> Survey design:
#> twophase2(id = id, strata = strata, probs = probs, fpc = fpc, 
#>     subset = subset, data = data)
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     -2.6789     0.1089 -24.594  < 2e-16 ***
#> factor(stage)2                   0.7515     0.1474   5.097 4.04e-07 ***
#> factor(stage)3                   0.7727     0.1528   5.057 4.98e-07 ***
#> factor(stage)4                   1.0109     0.1753   5.767 1.04e-08 ***
#> factor(histol)2                  1.1565     0.3170   3.648 0.000277 ***
#> factor(stage)2:factor(histol)2   0.3986     0.4328   0.921 0.357243    
#> factor(stage)3:factor(histol)2   0.6956     0.4156   1.674 0.094455 .  
#> factor(stage)4:factor(histol)2   1.8586     0.4651   3.996 6.85e-05 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for quasibinomial family taken to be 1.000873)
#> 
#> Number of Fisher Scoring iterations: 4
#> 

 ## Stratification on stage is really post-stratification, so we should use calibrate()
 gccs8<-calibrate(dccs2, phase=2, formula=~interaction(rel,stage,instit))
 summary(svyglm(rel~factor(stage)*factor(histol),design=gccs8,family=quasibinomial()))
#> 
#> Call:
#> svyglm(formula = rel ~ factor(stage) * factor(histol), design = gccs8, 
#>     family = quasibinomial())
#> 
#> Survey design:
#> calibrate(dccs2, phase = 2, formula = ~interaction(rel, stage, 
#>     instit))
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     -2.6789     0.1090 -24.577  < 2e-16 ***
#> factor(stage)2                   0.7515     0.1474   5.099 4.00e-07 ***
#> factor(stage)3                   0.7727     0.1527   5.061 4.87e-07 ***
#> factor(stage)4                   1.0109     0.1758   5.752 1.13e-08 ***
#> factor(histol)2                  1.1565     0.3168   3.651 0.000273 ***
#> factor(stage)2:factor(histol)2   0.3986     0.4321   0.922 0.356551    
#> factor(stage)3:factor(histol)2   0.6956     0.4149   1.676 0.093922 .  
#> factor(stage)4:factor(histol)2   1.8586     0.4650   3.997 6.84e-05 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for quasibinomial family taken to be 1.000873)
#> 
#> Number of Fisher Scoring iterations: 4
#> 

 ## For this saturated model calibration is equivalent to estimating weights.
 pccs8<-calibrate(dccs2, phase=2,formula=~interaction(rel,stage,instit), calfun="rrz")
 summary(svyglm(rel~factor(stage)*factor(histol),design=pccs8,family=quasibinomial()))
#> 
#> Call:
#> svyglm(formula = rel ~ factor(stage) * factor(histol), design = pccs8, 
#>     family = quasibinomial())
#> 
#> Survey design:
#> calibrate(dccs2, phase = 2, formula = ~interaction(rel, stage, 
#>     instit), calfun = "rrz")
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     -2.6789     0.1090 -24.577  < 2e-16 ***
#> factor(stage)2                   0.7515     0.1474   5.099 4.00e-07 ***
#> factor(stage)3                   0.7727     0.1527   5.061 4.87e-07 ***
#> factor(stage)4                   1.0109     0.1758   5.752 1.13e-08 ***
#> factor(histol)2                  1.1565     0.3168   3.651 0.000273 ***
#> factor(stage)2:factor(histol)2   0.3986     0.4321   0.922 0.356551    
#> factor(stage)3:factor(histol)2   0.6956     0.4149   1.676 0.093922 .  
#> factor(stage)4:factor(histol)2   1.8586     0.4650   3.997 6.84e-05 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for quasibinomial family taken to be 1.000873)
#> 
#> Number of Fisher Scoring iterations: 4
#> 

 ## Since sampling is SRS at phase 1 and stratified RS at phase 2, we
 ## can use method="simple" to save memory.
 dccs8_simple<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)),
    data=nwtco, subset=~incc2,method="simple")
 summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8_simple,family=quasibinomial()))
#> 
#> Call:
#> svyglm(formula = rel ~ factor(stage) * factor(histol), design = dccs8_simple, 
#>     family = quasibinomial())
#> 
#> Survey design:
#> twophase(id = list(~seqno, ~seqno), strata = list(NULL, ~interaction(rel, 
#>     stage, instit)), data = nwtco, subset = ~incc2, method = "simple")
#> 
#> Coefficients:
#>                                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                     -2.6789     0.1089 -24.591  < 2e-16 ***
#> factor(stage)2                   0.7515     0.1474   5.097 4.06e-07 ***
#> factor(stage)3                   0.7727     0.1528   5.056 5.00e-07 ***
#> factor(stage)4                   1.0109     0.1753   5.766 1.05e-08 ***
#> factor(histol)2                  1.1565     0.3171   3.648 0.000277 ***
#> factor(stage)2:factor(histol)2   0.3986     0.4328   0.921 0.357276    
#> factor(stage)3:factor(histol)2   0.6956     0.4156   1.674 0.094479 .  
#> factor(stage)4:factor(histol)2   1.8586     0.4651   3.996 6.86e-05 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for quasibinomial family taken to be 1.000873)
#> 
#> Number of Fisher Scoring iterations: 4
#>