Database-backed survey objects

The data= argument to svydesign can specify a table or view in a relational database. At the moment (version 3.9-1) the database must have an R interface based on DBI or ODBC. Only read access to the database is needed; no attempts are made to modify the underlying data

The DBI interface has only been tested using SQLite, but should work with the ROracle, RMySQL, and RJDBC packages, and probably with RdbiPgSQL from the Bioconductor project. The ODBC interface has been tested using SQLite under OS X and using MS Access and SQLite under Windows.

As an example, we analyse some data from NHANES III. The multiply-imputed data sets have been loaded into a SQLite database as core, imp1, ..., imp5, and merged to create database views for analysis, with

create view set1 as select * from core inner join imp1 using(SEQN)

We will use the first imputed data set, set1. The call to syvdesign specifies data="set1" as the data table and gives dbtype="SQLite", to use the SQLite driver, and dbname="~/nhanes/imp.db" as the database name. SQLite does not need authentication, but any additional arguments that dbConnect will need to specify network address, authentication,etc, can be passed to svydesign.

> library(survey)
> library(RSQLite)
Loading required package: DBI
> 
> dhanes<-svydesign(id=~SDPPSU6, strat=~SDPSTRA6, weight=~WTPFQX6,  nest=TRUE, data="set1", dbname='~/nhanes/imp.db', dbtype="SQLite")
>
> dhanes
DB-backed Stratified 1 - level Cluster Sampling design (with replacement)
With (98) clusters.
svydesign(ids = ids, probs = probs, strata = strata, data = design.data, 
    fpc = fpc, variables = variables, nest = nest, check.strata = check.strata, 
    weights = weights)

Despite containing 33994 records on 268 variables, the object is relatively small,about 4Mb

> dim(dhanes)
[1] 33994   268
> object.size(dhanes)
[1] 4363168

Most of the 4Mb is the design meta-data: weights, and PSU and strata identifiers. The entire data set would be 75Mb and reading it in would temporarily require more than 200Mb of memory. The savings are largest when the number of variables is large: the size of the object is proportional to the number of records but independent of the number of variables.

If we had the data in a database with an ODBC interface, as would usually be the case on Windows, the svydesign call would be very similar. We use dbtype="ODBC" to specify the ODBC interface and for the dbname argument we use the declared DSN (data source name) for the database (declared in the ODBC applet in the Control Panel).

odhanes<-svydesign(id=~SDPPSU6, strat=~SDPSTRA6, weight=~WTPFQX6,  nest=TRUE, data="set1", dbname='nhanes3', dbtype="ODBC")

We can calculate some simple means and proportions for height, weight, and self-rated health


> svymean(~factor(HAB1MI), dhanes)
                     mean     SE
factor(HAB1MI)-9 0.252690 0.0045
factor(HAB1MI)1  0.155718 0.0053
factor(HAB1MI)2  0.229727 0.0055
factor(HAB1MI)3  0.244180 0.0053
factor(HAB1MI)4  0.094067 0.0042
factor(HAB1MI)5  0.023618 0.0012
> svyby(~BMPWTMI,~factor(HAB1MI), design=dhanes, svymean)
   factor(HAB1MI) statistics.BMPWTMI se.BMPWTMI
-9             -9           33.66272  0.4445430
1               1           72.12343  0.4095566
2               2           74.28381  0.3574284
3               3           76.24362  0.5225713
4               4           76.67077  0.6193902
5               5           74.04166  0.9048933
> svycoplot(BMPWTMI~HSAGEIR|HAB1MI, dhanes)

The people in poor health tend to be older, but not to have much difference in weight. The unavailable self-rated health is for children.

Hematocrit (red blood cell concentration) is slightly lower for people poor health (adjusted for age), and the difference between levels 1 and 5 is statistically significant.

> m<-svyglm(HTPMI~HSAGEIR+factor(HAB1MI),design=dhanes)
> summary(m)

Call:
svyglm(HTPMI ~ HSAGEIR + factor(HAB1MI), design = dhanes)

Survey design:
svydesign(ids = ids, probs = probs, strata = strata, data = design.data, 
    fpc = fpc, variables = variables, nest = nest, check.strata = check.strata, 
    weights = weights)

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     36.1120776  0.1441689 250.484   <2e-16 ***
HSAGEIR         -0.0001076  0.0020744  -0.052    0.959    
factor(HAB1MI)1  5.8721946  0.1654730  35.487   <2e-16 ***
factor(HAB1MI)2  5.8431314  0.1417736  41.215   <2e-16 ***
factor(HAB1MI)3  5.8879374  0.1760945  33.436   <2e-16 ***
factor(HAB1MI)4  5.4271810  0.1805892  30.053   <2e-16 ***
factor(HAB1MI)5  5.0784499  0.2903547  17.491   <2e-16 ***
---

> svycontrast(m, c(`factor(HAB1MI)1`=1, `factor(HAB1MI)5`=-1))
         contrast     SE
contrast  0.79374 0.2416