Estimated weights for missing data

Creates or adjusts a two-phase survey design object using a logistic regression model for second-phase sampling probability. This function should be particularly useful in reweighting to account for missing data.

estWeights(data,formula,...)
# S3 method for twophase
estWeights(data,formula=NULL, working.model=NULL,...)
# S3 method for data.frame
estWeights(data,formula=NULL, working.model=NULL,
      subset=NULL, strata=NULL,...)

Arguments

data: twophase design object or data frame
formula: Predictors for estimating weights
working.model: Model fitted to complete (ie phase 1) data
subset: Subset of data frame with complete data (ie phase 1). If NULL use all complete cases
strata: Stratification (if any) of phase 2 sampling
...: for future expansion

Details

If data is a data frame, estWeights first creates a two-phase design object. The strata argument is used only to compute finite population corrections, the same variables must be included in formula to compute stratified sampling probabilities.

With a two-phase design object, estWeights estimates the sampling probabilities using logistic regression as described by Robins et al (1994) and adds information to the object to enable correct sandwich standard errors to be computed.

An alternative to specifying formula is to specify working.model. The estimating functions from this model will be used as predictors of the sampling probabilities, which will increase efficiency to the extent that the working model and the model of interest estimate the same parameters (Kulich \& Lin 2004).

The effect on a two-phase design object is very similar to calibrate, and is identical when formula specifies a saturated model.

Value

A two-phase survey design object.

References

Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. (2009) Using the Whole Cohort in the Analysis of Case-Cohort Data. Am J Epidemiol. 2009 Jun 1;169(11):1398-405.

Robins JM, Rotnitzky A, Zhao LP. (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846-866.

Kulich M, Lin DY (2004). Improving the Efficiency of Relative-Risk Estimation in Case-Cohort Studies. Journal of the American Statistical Association, Vol. 99, pp.832-844

Lumley T, Shaw PA, Dai JY (2011) "Connections between survey calibration estimators and semiparametric models for incomplete data" International Statistical Review. 79:200-220. (with discussion 79:221-232)

Examples

data(airquality)

## ignoring missingness, using model-based standard error
summary(lm(log(Ozone)~Temp+Wind, data=airquality))
#> 
#> Call:
#> lm(formula = log(Ozone) ~ Temp + Wind, data = airquality)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.34415 -0.25774  0.03003  0.35048  1.18640 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.531932   0.608901  -0.874  0.38419    
#> Temp         0.057384   0.006455   8.889 1.13e-14 ***
#> Wind        -0.052534   0.017128  -3.067  0.00271 ** 
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.5644 on 113 degrees of freedom
#>   (37 observations deleted due to missingness)
#> Multiple R-squared:  0.5821,	Adjusted R-squared:  0.5747 
#> F-statistic: 78.71 on 2 and 113 DF,  p-value: < 2.2e-16
#> 

## Without covariates to predict missingness we get
## same point estimates, but different (sandwich) standard errors
daq<-estWeights(airquality, formula=~1,subset=~I(!is.na(Ozone)))
summary(svyglm(log(Ozone)~Temp+Wind,design=daq))
#> 
#> Call:
#> svyglm(formula = log(Ozone) ~ Temp + Wind, design = daq)
#> 
#> Survey design:
#> estWeights(airquality, formula = ~1, subset = ~I(!is.na(Ozone)))
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.531932   0.831535  -0.640   0.5237    
#> Temp         0.057384   0.008432   6.806 5.07e-10 ***
#> Wind        -0.052534   0.020279  -2.591   0.0108 *  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.3130098)
#> 
#> Number of Fisher Scoring iterations: 2
#> 

## Reweighting based on weather, month
d2aq<-estWeights(airquality, formula=~Temp+Wind+Month,
                 subset=~I(!is.na(Ozone)))
summary(svyglm(log(Ozone)~Temp+Wind,design=d2aq))
#> 
#> Call:
#> svyglm(formula = log(Ozone) ~ Temp + Wind, design = d2aq)
#> 
#> Survey design:
#> estWeights(airquality, formula = ~Temp + Wind + Month, subset = ~I(!is.na(Ozone)))
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.577759   0.832906  -0.694   0.4893    
#> Temp         0.057689   0.008394   6.872 3.65e-10 ***
#> Wind        -0.048750   0.020524  -2.375   0.0192 *  
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.323215)
#> 
#> Number of Fisher Scoring iterations: 2
#>

Arguments

Details

Value

References

See also

Examples