A framework that brings together an abundance of common statistical models found across packages into a unified interface, and provides a common architecture for estimation and interpretation, as well as bridging functions to absorb increasingly more models into the package. Zelig allows each individual package, for each statistical model, to be accessed by a common uniformly structured call and set of arguments. Moreover, Zelig automates all the surrounding building blocks of a statistical work-flow--procedures and algorithms that may be essential to one user's application but which the original package developer did not use in their own research and might not themselves support. These include bootstrapping, jackknifing, and re-weighting of data. In particular, Zelig automatically generates predicted and simulated quantities of interest (such as relative risk ratios, average treatment effects, first differences and predicted and expected values) to interpret and visualize complex models.
Development: Dev-Blog
All models in Zelig can be estimated and results explored presented using four simple functions:
zelig
to estimate the parameters,
setx
to set fitted values for which we want to find quantities of
interest,
sim
to simulate the quantities of interest,
plot
to plot the simulation results.
Zelig 5 introduced reference classes. These enable a different way of working with Zelig that is detailed in a separate vignette. Directly using the reference class architecture is optional. They are not used in the examples below.
Let’s walk through an example. This example uses the swiss dataset. It contains data on fertility and socioeconomic factors in Switzerland’s 47 French-speaking provinces in 1888 (Mosteller and Tukey, 1977, 549-551). We will model the effect of education on fertility, where education is measured as the percent of draftees with education beyond primary school and fertility is measured using the common standardized fertility measure (see Muehlenbein (2010, 80-81) for details).
If you haven't already done so, open your R console and install Zelig. We recommend installing Zelig with the zeligverse package. This installs core Zelig and ancillary packages at once.
install.packages('zeligverse')
Alternatively you can install the development version of Zelig with:
devtools::install_github('IQSS/Zelig')
Once Zelig is installed, load it:
library(zeligverse)
Let’s assume we want to estimate the effect of education on fertility.
Since fertility is a continuous variable, least squares (ls
) is an
appropriate model choice. To estimate our model, we call the zelig()
function with three two arguments: equation, model type, and data:
# load data
data(swiss)
# estimate ls model
z5_1 <- zelig(Fertility ~ Education, model = "ls", data = swiss, cite = FALSE)
# model summary
summary(z5_1)
## Model:
##
## Call:
## z5$zelig(formula = Fertility ~ Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.036 -6.711 -1.011 9.526 19.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.6101 2.1041 37.836 < 2e-16
## Education -0.8624 0.1448 -5.954 3.66e-07
##
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared: 0.4406, Adjusted R-squared: 0.4282
## F-statistic: 35.45 on 1 and 45 DF, p-value: 3.659e-07
##
## Next step: Use 'setx' method
The -0.86 coefficient on education suggests a negative relationship
between the education of a province and its fertility rate. More
precisely, for every one percent increase in draftees educated beyond
primary school, the fertility rate of the province decreases 0.86 units.
To help us better interpret this finding, we may want other quantities
of interest, such as expected values or first differences. Zelig makes
this simple by automating the translation of model estimates into
interpretable quantities of interest using Monte Carlo simulation
methods (see King, Tomz, and Wittenberg (2000) for more information).
For example, let’s say we want to examine the effect of increasing the
percent of draftees educated from 5 to 15. To do so, we set our
predictor value using the setx()
and setx1()
functions:
# set education to 5 and 15
z5_1 <- setx(z5_1, Education = 5)
z5_1 <- setx1(z5_1, Education = 15)
# model summary
summary(z5_1)
## setx:
## (Intercept) Education
## 1 1 5
## setx1:
## (Intercept) Education
## 1 1 15
##
## Next step: Use 'sim' method
After setting our predictor value, we simulate using the sim()
method:
# run simulations and estimate quantities of interest
z5_1 <- sim(z5_1)
# model summary
summary(z5_1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 75.30616 1.658283 75.28057 72.12486 78.48007
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 75.28028 9.707597 75.60282 57.11199 94.3199
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 66.66467 1.515977 66.63699 63.66668 69.64761
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 66.02916 9.441273 66.32583 47.19223 82.98039
## fd
## mean sd 50% 2.5% 97.5%
## 1 -8.641488 1.442774 -8.656953 -11.43863 -5.898305
At this point, we’ve estimated a model, set the predictor value, and
estimated easily interpretable quantities of interest. The summary()
method shows us our quantities of interest, namely, our expected and
predicted values at each level of education, as well as our first
differences–the difference in expected values at the set levels of
education.
Zelig’s plot()
function plots the estimated quantities of interest:
plot(z5_1)
We can also simulate and plot simulations from ranges of simulated values:
z5_2 <- zelig(Fertility ~ Education, model = "ls", data = swiss, cite = FALSE)
# set Education to range from 5 to 15 at single integer increments
z5_2 <- setx(z5_2, Education = 5:15)
# run simulations and estimate quantities of interest
z5_2 <- sim(z5_2)
Then use the plot()
function as before:
z5_2 <- plot(z5_2)
The primary documentation for Zelig is available at: http://docs.zeligproject.org/articles/.
Within R, you can access function help using the normal ?
function,
e.g.:
?setx
If you are looking for details on particular estimation model methods,
you can also use the ?
function. Simply place a z
before the model
name. For example, to access details about the logit
model use:
?zlogit
Zelig can be fully checked and build using the code in check_build_zelig.R. Note that this can be time consuming due to the extensive test coverage.
All changes to Zelig are documented here. GitHub issue numbers are given after each change note when relevant. See https://github.com/IQSS/Zelig/issues. External contributors are referenced with their GitHub usernames when applicable.
++++ All Zelig time series models will be deprecated on 1 February 2018 ++++
Resolved an issue where odds_ratios
standard errors were not correctly
returned for logit
and relogit
models. Thanks to @retrography. #302
Zelig 4 compatability wrappers now work for arima
models. Thanks to
@mbsabath. #280
Resolved an error when only setx
was called with arima
models Thanks to
@mbsabath. #299
Resolved an error when summary
was called after sim
for arima
models.
Resolved an error when sim
is used with differenced first-order
autoregressive models. #307
arima
models return informative error when data
is not found. #308
Compatibility with testthat 2.0.0
Documentation updated to correctly reflect that tobit
wraps AER::tobit
.
Speed improvements made to relogit
. Thanks to @retrography. #88
Returns relogit
weighted case control method to that described in
King and Langche (2001, eq. 11) and used in the Stata relogit
implementation.
summary
with relogit
models via the
odds_ratios = TRUE
argument. #302zquantile
with Amelia imputed data now working. #277
vcov
now works with rq
quantile regression models.
More informative error handling for conflicting timeseries
model
arguments. #283
Resolved and issue with relogit
that produced a warning when the fitted
model object was passed to predict
. #291
!EXPERIMENTAL! interface function to_zelig
allows users to convert fitted model
objects fitted outside of Zelig to a Zelig object. The function is called
within the setx
wrapper if a non-Zelig object is supplied. Currently
only works for models fitted with lm
and many estimated with glm
and
svyglm
. #189
get_se
and get_pvalue
function wrappers created for get_se
and
get_pvalue
methods, respectively. #269
If combine_coef_se
is given a model estimated without multiply imputed
data or bootstraps, an error is no longer returned. Instead a list of the
models' untransformed coefficients, standard errors, and p-values is returned. #268
summary
for logit
models now accepts the argument odds_ratios
. When
TRUE
odds ratio estimates are returned rather than coefficient estimates.
Thanks to Adam Obeng. PR/#270.
setx
and sim
fail informatively when passed ZeligEI objects. #271
Resolved a bug where weights
were not being passed to svydesign
in survey models. #258
Due to limited functionality and instability, zelig survey estimations
no return a warning and a link to documentation on how to use to_survey
via setx
to bipass zelig
. #273
Resolved a bug where from_zelig_model
would not extract fitted model
objects for models estimated using vglm
. #265
get_pvalue
and get_se
now work for models estimated using vglm
. #267
Improved ivreg
, mlogit
, and getter (#266) documentation.
Average Treatment Effect on the Treated (ATT) vignette added to the online documentation http://docs.zeligproject.org/articles/att.html
Corrected vignette URLs.
Introduce a new model type for instrumental-variable regression: ivreg
based on the ivreg
from the AER package. #223
Use the Formula package for formulas. This will enable a common syntax for
multiple equations, though currently in Core Zelig it is only
enhances ivreg
. #241
zelig
calls now support update
ing formulas (#244) and .
syntax for
inserting all variables from data
on the right-hand side of the formula
Arbitrary log
transformations are now supported in zelig
calls
(exept for ivreg
regressors). #225
Arbitrary as.factor
and factor
transformations are now supported in
zelig
calls.
Restored quantile regression (model = "rq"
). Currently only supports one
tau
at a time. #255
Added get_qi
wrapper for get_qi
method.
Added ATT
wrapper for ATT
method.
gee
models can now be estimated with multiply imputed data. #263
zelig
returns an error if weights
are specified in a model estimated
with multiply imputed data. (not possible before, but uninformative error
returned)
Code improvement to factor_coef_combine
so it does not return a warning
for model types with more than 1 declared class.
Reorganize README files to meet new CRAN requirements.
Switch bind_rows
for rbind_all
in zquantile
as the latter is depricated.
Depends on the survival package in order to enable setx
for exponential
models without explicitly loading survival. #254
relogit
now only accepts one tau
per call (similar to quantile
). Fixed
to address #257.
Additional unit tests.
New function combine_coef_se
takes as input a zelig
model estimated
using multiply imputed data or bootstrapping and returns a list of coefficients,
standard errors, z-values, and p-values combined across the estimations. Thanks
to @vincentarelbundock for prompting. #229
The following changes were primarily made to re-established Zelig integration with WhatIf. #236
Added zelig_setx_to_df
for extracted fitted values created by setx
.
Fitted factor level variable values are returned in a single column (not
by parameter level) by zelig_qi_to_df
.
setrange
(including setx
used with a range of fitted values) now creates
scenarios based on matches of equal length set ranges. This enables setx
to
work with polynomials, splines, etc. (currently only when these are created
outside of the zelig
call). #238
Resolve a bug where appropriate plot
s were not created for mlogitbayes
. #206
Arguments (such as xlab
) can now be passed to plot
. #237
zelig_qi_to_df
and qi_slimmer
bug with multinomial response models
resolved. #235
Resolved a bug where coef
, coefficients
, vcov
, fitted
, and predict
returned errors. Thanks to @vincentarelbundock for initially reporting. #231
Reduced number of digits show from summary
for fitted model objects.
!! Breaking change !! the get*
functions (e.g. getcoef
) now use
underscores _
to delimit words in the function names (e.g. get_coef
). #214
Added a number of new "getter" methods for extracting estimation elements:
get_names
method to return Zelig object field names. Also available via
names
. #216
get_residuals
to extract fitted model residuals. Also available via
residuals
.
get_df_residuals
method to return residual degrees-of-freedom.
Also accessible via df.residuals
.
get_model_data
method to return the data frame used to estimate the
original model.
get_pvalue
and get_se
methods to return estimated model p-values and
standard errors. Thank you to @vincentarelbundock for contributions. #147
zelig_qi_to_df
function for extracting simulated quantities of interest
from a Zelig object and returning them as a tidy-formatted data frame. #189
setx
returns an error if it is unable to find a supplied variable name.
setx1
wrapper added to facilitate piped workflows for first differences.
zelig
can handle independent variables that are transformed using the
natural logarithm inside of the call. #225
Corrected an issue where plot
would tend to choose a factor level as the
x-axis variable when plotting a range of simulations. #226
If a factor level variable's fitted value is not specified in setx
and
it is multi-modal, the last factor in the factor list is arbitrarily chosen.
This replaces previous behavior where the level was randomly chosen, causing
unuseful quantity of interest range plots. #226
Corrected a bug where summary
for ranges of setx
would only show the
first scenario. Now all scenarios are shown. #226
Corrected a bug where the README.md was not included in the CRAN build.
to_zelig_mi
now can accept a list of data frames. Thanks to
@vincentarelbundock.
Internal code improvements.
Allows users to convert an independent variable to a factor within a zelig
call using as.factor
. #213
from_zelig_model
function to extract original fitted model objects from
zelig
estimation calls. This is useful for conducting non-Zelig supported
post-estimation and easy integration with the texreg and stargazer packages
for formatted parameter estimate tables. #189
Additional MC tests for a wide range of models. #160
Resolves a bug from set
where sim
would fail for models that included
factor level independent variables. #156
Fixed an issue with model-survey
where ids
was hard coded as ~1
. #144
Fixed ATT
bug introduced in 5.0-14. #194
Fixed ci.plot
bug with timeseries
models introduced in 5.0-15. #204
mode
has been deprecated. Please use Mode
. #152
The Zelig 4 sim
wrapper now intelligently looks for fitted values from the
reference class object if not supplied via the x argument.
New to_zelig_mi
utility function for combining multiply imputed data sets
for passing to zelig
. mi
will also work to enable backwards compatibility. #178
Initial development on a new testing architecture and more tests for
model-*
, Zelig 4 wrappers, ci.plot
, and the Zelig workflow.
graph
method now accepts simulations from setx
and setrange
. For the
former it uses qi.plot
and ci.plot
for the latter.
Improved error messages for Zelig 4 wrappers.
Improved error messages if Zelig methods are supplied with too little information.
model-arima
now fails if the dependent variable does not vary for one of the
cases.
Minor documentation improvements for Zelig 4 wrappers.
Dynamically generated README.md.
Removed plyr package dependency.
rbind_all
replaced by bind_rows
as the former is deprecated by dplyr.
Other internal code improvements