Programming with survey designs

Programming with survey objects

Extracting information model.frame() extracts variables from a survey design, weights() extracts the weights.

> data(fpc)
> withfpc <- svydesign(ids = ~psuid, strata = ~stratid, 
+     fpc = ~Nh, variables = ~x, data = fpc, nest = TRUE)
> model.frame(withfpc)
    x
1 2.8
2 4.1
3 6.8
4 6.8
5 9.2
6 3.7
7 6.6
8 4.2
> weights(withfpc)
1 2 3 4 5 6 7 8 
3 3 3 3 3 4 4 4 
>

Replicate weights for new estimators Replicate weight variances can be computed for arbitrary statistics using withReplicates. The statistic can be specified either as a function of two arguments, a data frame and a weights, or as an expression where the sampling weights are represented by .weights. As an example, using the dclus1 design we have used in several earlier examples, we have two ways to compute a ratio.

> withReplicates(rclus1, quote(sum(.weights*api.stu)/sum(.weights*enroll)))
       theta     SE
[1,] 0.84971 0.0095
>  withReplicates(rclus1, function(w,data) sum(w*data$api.stu)/sum(w*data$enroll))
       theta     SE
[1,] 0.84971 0.0095

Delta method: svycontrast will compute nonlinear functions of estimates, using symbolic differentiation to calculate the necessary derivatives for the delta method. This can be used to compute new summary statistics of categorical data. For example svykappa computes marginal and joint proportions

       mean     SE
a   0.27322 0.0303
b   0.72678 0.0303
A   0.12568 0.0204
B   0.87432 0.0204
a.A 0.12568 0.0204
b.A 0.00000 0.0000
a.B 0.14754 0.0171
b.B 0.72678 0.0303

and then constructs an expression (a.A + b.B - (a * A + b * B))/(1 - (a * A + b * B)) and evaluates it with svycontrast.

M-estimation The svymle function allows estimation by maximizing a user-specified additive objective function, with a regression model for any parameters of the objective function. For example, suppose we want a censored lognormal regression model. The objective function is a vectorized function to compute the loglikelihood of each observation. The input x is a two-column matrix with first column the observed value and second column the censoring estimator.

  lcens<-function(x,mean,sd){
      ifelse(x[,2]==1,
            dnorm(log(x[,1]),mean,sd,log=TRUE),
            pnorm(log(x[,1]),mean,sd,log=TRUE,lower.tail=FALSE)
            )
  }

Taylor linearization variances also require the derivative of the objective function

  gcens<- function(x,mean,sd){
        dz<- -dnorm(log(x[,1]),mean,sd)/pnorm(log(x[,1]),mean,sd,lower.tail=FALSE)
        dm<-ifelse(x[,2]==1,
                    2*(log(x[,1]) - mean)/(2*sd^2),
                    dz*-1/sd)
        ds<-ifelse(x[,2]==1,
                   (log(x[,1])-mean)^2*(2*(2 * sd))/(2*sd^2)^2 - sqrt(2*pi)/(sd*sqrt(2*pi)),
                   ds<- dz*-(log(x[,1])-mean)/(sd*sd))
        cbind(dm,ds)      
  }

We use a data set from the "survival" package

data(pbc, package="survival")
biasmodel<-glm(I(trt>0)~age*edema,data=pbc)
pbc$randprob<-fitted(biasmodel)
dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,trt>0))

The loglikelihood function has two parameters: the mean and sd of log(x). We specify a regression model for each parameter. The mean depends on three variables

mean = I(cbind(time, status)) ~ bili + protime +  alb

and the standard deviation is just a constant

sd = ~1

It's necessary to specify starting values for the optimization. We start the mean model with an intercept of 10 and the standard deviation at 1.

start = list(c(10, 0, 0, 0), c(1))

Putting this all together

> summary(svymle(loglike = lcens, gradient = gcens, design = dpbc, 
+     formulas = list(mean = I(cbind(time, status)) ~ bili + protime + 
+         alb, sd = ~1), start = list(c(10, 0, 0, 0), c(1))))
Survey-sampled mle: 
svymle(loglike = lcens, gradient = gcens, design = dpbc, formulas = list(mean = I(cbind(time, 
    status)) ~ bili + protime + alb, sd = ~1), start = list(c(10, 
    0, 0, 0), c(1)))
                       Coef         SE p.value
mean.(Intercept)  8.9803185 1.27851656  <0.001
mean.bili        -0.0927236 0.01648777  <0.001
mean.protime     -0.3689851 0.09495636  <0.001
mean.alb          0.9535612 0.17204268  <0.001
sd.(Intercept)    0.9888172 0.07561830  <0.001
Stratified Independent Sampling design
svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(pbc, 
    trt > 0))