Simple summary statistics

To demonstrate the calculation of simple summary statistics I will use the dclus1 and rclus1 survey objects created in earlier examples.

Some information about the designs is provided by the print and summary methods:
 > dclus1
 1 - level Cluster Sampling design
 With (15) clusters.
 svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)

 > summary(rclus1) 
 Survey with replicate weights:
 Call: as.svrepdesign(dclus1)
 Unstratified cluster jacknife (JK1) with 15 replicates.
 Variables:
  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"
  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"
 [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"
 [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"
 [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
 [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"
 [37] "api.stu"  "fpc"      "pw"

All the analysis functions take a survey design object as one of the arguments, and use a model formula to specify variables for analysis. First look at svymean.
 > svymean(~api00, dclus1)
         mean     SE
 api00 644.17 23.542
This asks for the mean (and standard error of the mean) for the variable api00, the year 2000 Academic Performance Index. We can ask for means of more than one variable at a time:
 > svymean(~api00+api99+stype, dclus1)
              mean      SE
 api00  644.169399 23.5422
 api99  606.978142 24.2250
 stypeE   0.786885  0.0463
 stypeH   0.076503  0.0268
 stypeM   0.136612  0.0296
Here we have the means of 1999 and 2000 API and school type (elementary, middle, high). Note that for the factor variable stype the proportion in each category is reported.

The same syntax is used to compute means from the survey object rclus1, but here the standard errors are computed from the jackknife replicate weights and are slightly different.
> svymean(~api00+api99+stype, rclus1)
             mean      SE
api00  644.169399 26.3294
api99  606.978142 26.9985
stypeE   0.786885  0.0514
stypeH   0.076503  0.0278
stypeM   0.136612  0.0332

Totals are estimated with svytotal. Here we estimate the total number of students enrolled, and the total number of schools by type, across the California population.
> svytotal(~enroll+stype, dclus1)
            total        SE
enroll 3404940.13 932235.03
stypeE    4873.97   1333.32
stypeH     473.86    158.70
stypeM     846.17    167.55
Note again that totals for factor variables are interpreted as total numbers in each category.

Again, the syntax is the same for the rclus1 object that incorporates replicate weights:

> svytotal(~enroll+stype, rclus1)
            total        SE
enroll 3404940.13 932235.03
stypeE    4873.97   1333.32
stypeH     473.86    158.70
stypeM     846.17    167.55

The functions for totals and means can also report the design effect, with the option deff=TRUE

> svytotal(~enroll+stype, dclus1, deff=TRUE)
            total         SE    DEff
enroll 3404940.13  932235.03 31.4827
stypeE    4873.97    1333.32 52.1047
stypeH     473.86     158.70  1.7521
stypeM     846.17     167.55  1.1698
though this is not currently available with replicate weights.

Ratio estimates are computed with svyratio. This has two formula arguments, specifying one or more numerator variables and one or more denominator variables. In this example we estimate the proportion of students who took the API test from the number who took the test and the number enrolled.
 > svyratio(~api.stu,~enroll, dclus1)
 Ratio estimator: svyratio.survey.design2(~api.stu, ~enroll, dclus1)
 Ratios=
            enroll
 api.stu 0.8497087
 SEs=
              enroll
 api.stu 0.008386297
Again, the syntax is the same for a version with replicate weights

The population variance is estimated with svyvar, with a similar syntax to means and totals
 > svyvar(~api00+api99, rclus1)
       variance     SE
 api00    11122 1729.2
 api99    12666 1655.3

Quantiles are a more difficult estimation problem. It is easy enough to find a point estimate, but many confidence interval methods fail. There are two confidence interval calculation methods for quantiles in objects created with svydesign. The default method is substantially faster but probably less accurate for tail quantiles or for small data sets.

In addition to specifying the variables and the design object, it is necessary to specify which quantiles to estimate. Here we estimate the median and quartiles.
 >  svyquantile(~api00, dclus1, c(.25,.5,.75), ci=TRUE)
 $quantiles
         0.25 0.5  0.75
 api00 551.75 652 717.5

 $CIs
 , , api00
 
            0.25      0.5     0.75
 (lower 493.2835 564.3250 696.0000
 upper) 622.6495 710.8375 761.1355

 > dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
 > (qapi<-svyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE, interval.type="score")))
 $quantiles
         0.25 0.5  0.75
 api00 551.75 652 717.5
 
 $CIs
 , , api00
 
            0.25      0.5     0.75
 (lower 514.9998 581.9996 669.0003
 upper) 632.0005 699.9997 749.0001

 > SE(qapi)
     0.25      0.5     0.75
 29.84766 30.10264 20.40850
The confidence intervals are not symmetric and so cannot be generated by adding and subtracting 1.96 standard errors. Nonetheless, a reasonable estimate of the standard error is the length of the confidence interval divided by (2x1.96).

The syntax for replicate weights is similar. Again, there are two methods of variance estimation. The default is valid for all types of replicate weights and is based on computing a confidence interval for the probability and transforming it. The alternative, directly using the variance of replicates, is not valid for jackknife weights.


Thomas Lumley
Last modified: Thu May 19 09:05:05 PDT 2005