R: Survey sample analysis.

svydesign {survey}

R Documentation

Survey sample analysis.

Description

Specify a complex survey design.

Usage

svydesign(ids, probs=NULL, strata = NULL, variables = NULL, fpc=NULL,
data = NULL, nest = FALSE, check.strata = !nest, weights=NULL,pps=FALSE,...)
## Default S3 method:
svydesign(ids, probs=NULL, strata = NULL, variables = NULL, fpc=NULL,
data = NULL, nest = FALSE, check.strata = !nest, weights=NULL,pps=FALSE,variance=c("HT","YG"),...)
## S3 method for class 'imputationList':
svydesign(ids, probs = NULL, strata = NULL, variables = NULL, 
    fpc = NULL, data, nest = FALSE, check.strata = !nest, weights = NULL, pps=FALSE,
     ...)
## S3 method for class 'character':
svydesign(ids, probs = NULL, strata = NULL, variables = NULL, 
    fpc = NULL, data, nest = FALSE, check.strata = !nest, weights = NULL, pps=FALSE,
    dbtype = "SQLite", dbname, ...)

Arguments

`ids`	Formula or data frame specifying cluster ids from largest level to smallest level, `~0` or `~1` is a formula for no clusters.
`probs`	Formula or data frame specifying cluster sampling probabilities
`strata`	Formula or vector specifying strata, use `NULL` for no strata
`variables`	Formula or data frame specifying the variables measured in the survey. If `NULL`, the `data` argument is used.
`fpc`	Finite population correction: see Details below
`weights`	Formula or vector specifying sampling weights as an alternative to `prob`
`data`	Data frame to look up variables in the formula arguments, or database table name, or `imputationList` object, see below
`nest`	If `TRUE`, relabel cluster ids to enforce nesting within strata
`check.strata`	If `TRUE`, check that clusters are nested in strata
`pps`	`"brewer"` to use Brewer's approximation for PPS sampling without replacement. `"overton"` to use Overton's approximation. An object of class `HR` to use the Hartley-Rao approximation. An object of class `ppsmat` to use the Horvitz-Thompson estimator.
`dbtype`	name of database driver to pass to `dbDriver`
`dbname`	name of database (eg file name for SQLite)
`variance`	For `pps` without replacement, use `variance="YG"` for the Yates-Grundy estimator instead of the Horvitz-Thompson estimator
`...`	for future expansion

Details

The svydesign object combines a data frame and all the survey design information needed to analyse it. These objects are used by the survey modelling and summary functions. The id argument is always required, the strata, fpc, weights and probs arguments are optional. If these variables are specified they must not have any missing values.

By default, svydesign assumes that all PSUs, even those in different strata, have a unique value of the id variable. This allows some data errors to be detected. If your PSUs reuse the same identifiers across strata then set nest=TRUE.

The finite population correction (fpc) is used to reduce the variance when a substantial fraction of the total population of interest has been sampled. It may not be appropriate if the target of inference is the process generating the data rather than the statistics of a particular finite population.

The finite population correction can be specified either as the total population size in each stratum or as the fraction of the total population that has been sampled. In either case the relevant population size is the sampling units. That is, sampling 100 units from a population stratum of size 500 can be specified as 500 or as 100/500=0.2. The exception is for PPS sampling without replacement, where the sampling probability (which will be different for each PSU) must be used.

If population sizes are specified but not sampling probabilities or weights, the sampling probabilities will be computed from the population sizes assuming simple random sampling within strata.

For multistage sampling the id argument should specify a formula with the cluster identifiers at each stage. If subsequent stages are stratified strata should also be specified as a formula with stratum identifiers at each stage. The population size for each level of sampling should also be specified in fpc. If fpc is not specified then sampling is assumed to be with replacement at the top level and only the first stage of cluster is used in computing variances. If fpc is specified but for fewer stages than id, sampling is assumed to be complete for subsequent stages. The variance calculations for multistage sampling assume simple or stratified random sampling within clusters at each stage except possibly the last.

For PPS sampling without replacement it is necessary to specify the probabilities for each stage of sampling using the fpc arguments, and an overall weight argument should not be given. At the moment, multistage or stratified PPS sampling without replacement is supported only with pps="brewer", or by giving the full joint probability matrix using ppsmat. [Cluster sampling is supported by all methods, but not subsampling within clusters].

The dim, "[", "[<-" and na.action methods for survey.design objects operate on the dataframe specified by variables and ensure that the design information is properly updated to correspond to the new data frame. With the "[<-" method the new value can be a survey.design object instead of a data frame, but only the data frame is used. See also subset.survey.design for a simple way to select subpopulations.

The model.frame method extracts the observed data.

If the strata with one only PSU are not self-representing (or they are, but svydesign cannot tell based on fpc) then the handling of these strata for variance computation is determined by options("survey.lonely.psu"). See svyCprod for details.

data may be a character string giving the name of a table or view in a relational database that can be accessed through the DBI or ODBC interfaces. For DBI interfaces dbtype should be the name of the database driver and dbname should be the name by which the driver identifies the specific database (eg file name for SQLite). For ODBC databases dbtype should be "ODBC" and dbname should be the registed DSN for the database. On the Windows GUI, dbname="" will produce a dialog box for interactive selection.

The appropriate database interface package must already be loaded (eg RSQLite for SQLite, RODBC for ODBC). The survey design object will contain only the design meta-data, and actual variables will be loaded from the database as needed. Use close to close the database connection and open to reopen the connection, eg, after loading a saved object.

The database interface does not attempt to modify the underlying database and so can be used with read-only permissions on the database.

If data is an imputationList object (from the "mitools" package), svydesign will return a svyimputationList object containing a set of designs. Use with.svyimputationList to do analyses on these designs and MIcombine to combine the results.

Value

An object of class survey.design.

Author(s)

Thomas Lumley

Examples

  data(api)
# stratified sample
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
# one-stage cluster sample
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
# two-stage cluster sample: weights computed from population sizes.
dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2)

## multistage sampling has no effect when fpc is not given, so
## these are equivalent.
dclus2wr<-svydesign(id=~dnum+snum, weights=weights(dclus2), data=apiclus2)
dclus2wr2<-svydesign(id=~dnum, weights=weights(dclus2), data=apiclus2)

## syntax for stratified cluster sample
##(though the data weren't really sampled this way)
svydesign(id=~dnum, strata=~stype, weights=~pw, data=apistrat,
nest=TRUE)

## PPS sampling without replacement
data(election)
dpps<- svydesign(id=~1, fpc=~p, data=election_pps, pps="brewer")

##database example: requires RSQLite
## Not run: 
library(RSQLite)
dbclus1<-svydesign(id=~dnum, weights=~pw, fpc=~fpc,
data="apiclus1",dbtype="SQLite", dbname=system.file("api.db",package="survey"))

## End(Not run)

[Package survey version 3.19 Index]