Estimation and diagnosis of the convergence of Generalized Exponential Random Graph Models via Gibbs sampling or Metropolis Hastings with exponential down weighting.
An R package to estimate Generalized Exponential Random Graph Models
PLEASE REPORT ANY BUGS OR ERRORS TO firstname.lastname@example.org.
[08/06/16] A major update to all aspects of the package has been pushed to the public GERGM repo and will shortly be up on CRAN. Here are some highlights:
gergm()function send you updates about estimation to a designated slack channel. This is as close to having the package text message users as we could get. Useful for keeping up to date on long running models.
An R package which implements the Generalized Exponential Random Graph Model (GERGM) with an extension to estimation via Metropolis Hastings. The relevant papers detailing the model can be found at the links below:
See the Requirements for using C++ code with R section in the following tutorial: Using C++ and R code Together with Rcpp. You will likely need to install either
Rtools depending on whether you are using a Mac or Windows machine before you can use the package.
The easiest way to do this is to install the package from CRAN via the standard
This will take care of some weird compilation issues that can arise, and is the best option for most people. If you want the most current development version of the package (available here), you will need to start by making sure you have Hadley Wickham's devtools package installed.
Now we can install from Github using the following line:
I have had success installing this way on most major operating systems with R 3.2.0+ installed, but if you do not have the latest version of R installed, or run into some install errors (please email if you do), it should work as long as you install the dependencies first with the following block of code:
install.packages( pkgs = c("BH","RcppArmadillo","ggplot2","methods", "stringr","igraph", "plyr", "parallel", "coda", "vegan", "scales", "RcppParallel","slackr"), dependencies = TRUE)
GERGM package is installed, you may access its functionality as you would any other package by calling:
If all went well, check out the
?GERGM help file to see a full working example with info on how the data should look.
To use this package, first load in the network you wish to use as a (square) matrix, following the example provided below. You may then use the
gergm() function to estimate a model using any combination of the following statistics:
out2stars(alpha = 1),
in2stars(alpha = 1),
ctriads(alpha = 1),
mutual(alpha = 1),
ttriads(alpha = 1),
absdiff(covariate = "MyCov"),
sender(covariate = "MyCov"),
reciever(covariate = "MyCov"),
nodemix(covariate, base = "MyBase"),
edges(alpha = 1, method = c("regression","endogenous")). Note that the
edges term must be specified if the user wishes to include an intercept (strongly recommended). The user may select the "regression" method (default) to include an intercept in the lambda transformation of the network, or "endogenous" to include the intercept as in a traditional ERGM model. To use exponential down-weighting for any of the network level terms, simply specify a value for alpha less than 1. The
(alpha = 1) term may be omitted from the structural terms if no exponential down-weighting is required. In this case, the terms may be provided as:
If the network is undirected the user may only specify the following terms:
twostars(alpha = 1),
ttriads(alpha = 1),
absdiff(covariate = "MyCov"),
sender(covariate = "MyCov"),
nodemix(covariate, base = "MyBase"),
edges(alpha = 1, method = c("regression","endogenous")). The
gergm() function will provide all of the estimation and diagnostic functionality and the parameters of this function can be queried by typing
?gergm into the R console. You may also generate diagnostic plots using a GERGM Object returned by the
gergm() function by using any of the following functions:
Trace_Plot(). The user may also investigate the sensitivity of parameter estimates using the
hysteresis() function. This function simulates large numbers of networks at parameter values around the estimated parameter values an plots the mean network density at each of these values to examine whether the model becomes degenerate due to small deviations in the parameter estimates. See the following reference for details:
The user may also predict edge values (one at a time, conditioning on the rest of the edge values, and estimated parameters), using the
Here are two simple working examples using the
library(GERGM) ########################### 1. No Covariates ############################# # Preparing an unbounded network without covariates for gergm estimation # set.seed(12345) net <- matrix(rnorm(100,0,20),10,10) colnames(net) <- rownames(net) <- letters[1:10] formula <- net ~ edges(method = "endogenous") + mutual + ttriads test <- gergm(formula, normalization_type = "division", network_is_directed = TRUE, number_of_networks_to_simulate = 40000, thin = 1/10, proposal_variance = 0.2, MCMC_burnin = 10000, seed = 456, convergence_tolerance = 0.01, force_x_theta_update = 4) ########################### 2. Covariates ############################# # Preparing an unbounded network with covariates for gergm estimation # set.seed(12345) net <- matrix(runif(100,0,1),10,10) colnames(net) <- rownames(net) <- letters[1:10] node_level_covariates <- data.frame(Age = c(25,30,34,27,36,39,27,28,35,40), Height = c(70,70,67,58,65,67,64,74,76,80), Type = c("A","B","B","A","A","A","B","B","C","C")) rownames(node_level_covariates) <- letters[1:10] network_covariate <- net + matrix(rnorm(100,0,.5),10,10) formula <- net ~ edges(method = "regression") + mutual + ttriads + sender("Age") + netcov("network_covariate") + nodemix("Type",base = "A") test <- gergm(formula, covariate_data = node_level_covariates, number_of_networks_to_simulate = 100000, thin = 1/10, proposal_variance = 0.2, MCMC_burnin = 50000, seed = 456, convergence_tolerance = 0.01, force_x_theta_update = 2) # Generate Estimate Plot Estimate_Plot(test) # Generate GOF Plot GOF(test) # Generate Trace Plot Trace_Plot(test) # Generate Hysteresis plots for all structural parameter estimates hysteresis_results <- hysteresis(test, networks_to_simulate = 1000, burnin = 500, range = 2, steps = 20, simulation_method = "Metropolis", proposal_variance = 0.2)
Following on from the example above, we can also predict individual edge values, conditioning on the rest of the observed edges and estimated parameters. These predictions can then be plotted as a grid of box plots for each individual edge, as in the following example code:
test2 <- conditional_edge_prediction( GERGM_Object = test, number_of_networks_to_simulate = 1000, thin = 1, proposal_variance = 0.5, MCMC_burnin = 1000, seed = 123) # set working directory where we want to save the file. Only outputs a PDF # because plotting is too slow in the graphics device. setwd("~/Desktop") conditional_edge_prediction_plot(edge_prediction_results = test2, filename = "test.pdf", plot_size = 20, node_name_cex = 6) # make a scatter plot of predicted vs observed edge values conditional_edge_prediction_correlation_plot(edge_prediction_results = test2) # set x and y limits manually conditional_edge_prediction_correlation_plot(edge_prediction_results = test2, xlim = c(-10,10), ylim = c(-10,10))
There is also now functionality to run multiple
gergm() model specifications in parallel using the
parallel_gergm() function. This can come in very handy if the user wishes to specify the same model but for a large number of networks, or multiple models for the same network. Another useful (experimental) feature that can now be turned out is
hyperparameter_optimization = TRUE, which will seek to automatically optimize the number of networks simulated during MCMC, the burnin, the Metropolis Hastings proposal variance and will seek to address any issues with model degeneracy that arise during estimation by reducing exponential weights if using Metropolis Hastings. This feature is generally meant to make it easier and less time intensive to find a model that fits the data well.
set.seed(12345) net <- matrix(runif(100,0,1),10,10) colnames(net) <- rownames(net) <- letters[1:10] node_level_covariates <- data.frame(Age = c(25,30,34,27,36,39,27,28,35,40), Height = c(70,70,67,58,65,67,64,74,76,80), Type = c("A","B","B","A","A","A","B","B","C","C")) rownames(node_level_covariates) <- letters[1:10] network_covariate <- net + matrix(rnorm(100,0,.5),10,10) network_data_list <- list(network_covariate = network_covariate) formula <- net ~ edges +netcov("network_covariate") + nodematch("Type",base = "A") formula2 <- net ~ edges + sender("Age") + netcov("network_covariate") + nodemix("Type",base = "A") form_list <- list(f1 = formula, f2 = formula2) testp <- parallel_gergm(formula_list = form_list, observed_network_list = net, covariate_data_list = node_level_covariates, network_data_list = network_data_list, cores = 2, number_of_networks_to_simulate = 10000, thin = 1/10, proposal_variance = 0.1, MCMC_burnin = 5000)
Finally, if you specified an
output_name, you will want to check the
output_directory which will contain a number of .pdf's which can aide in assessing model fit and in determining the statistical significance of theta parameter estimates.
output_name is specified in the
gergm() function, then five files will be automatically generated and saved to the
output_directory. The example file names provided below are for
output_name = "Test":
The GERGM package incorporates a number of advanced features that are designed to aid in model fitting and evaluation, and expand the applicability of the model to a wider range of networks and network statistics.
force_x_theta_updates -- This can be a useful way to improve model fit, or to assess the robustness of theta estimates. The standard way to make sure that theta estimates have converged is to set a large value for the
convergence_tolerance parameter ( which defaults to 0.5). However, if the user suspects that the theta parameter estimates have not actually converged, despite the t-test p-values, then they can for the model to perform a certain number of theta updates. This is a particularly useful feature in helping to determine whether a specification suffers from likelihood degeneracy. The user can set the number of theta updates to be higher than the
convergence_tolerance parameter threshold would suggest, and see if theta parameter estimates diverge. If they do not, this is some indication that the estimation procedure is stable. However, in general, it is advisable to leave this parameter set to its default value of 1.
force_x_lambda_updates -- This parameter functions very similarly to the
force_x_theta_updates parameter, but is primarily useful for improving model fit. It may be the case that the model detects convergence, but due to slight changes in the lambda parameters at their last update, model fit may be impacted. Setting a higer value for this parameter can force the model to really "settle down" before completing estimation. In general, it is advisable to leave this parameter set to its default value of 1.
hyperparameter_optimization -- Turning on this option does a number of things which are meant to automate some of the process of model fitting. The main function of this option is to turn out Metropolis Hasting proposal variance optimization. If selected, the program will automatically seek to find a proposal variance that leads to an acceptance rate which is within five percent of
target_accept_rate. This adaptive metropolis step is done before the main estimation step, so it will not interfere with the usability of the MH samples we gather. The second function of this option is to attempt to automatically deal with likelihood degeneracy through the use of exponential down-weights (when using Metropolis-Hastings for estimation). If the estimation procedure detects problems with estimation (theta estimates zooming off to positive or negative infinity), the alpha weights for each of the endogenous statistics will be set to 90 percent of its current value. Decreasing these weights has been found to make it computationally easier to fit a model, however, the user is strongly encouraged to set alpha weights themselves or use the
theta_grid_optimization_list option to improve initialization.
stop_for_degeneracy -- If the user specifies this option, then estimation will automatically hard-stop when degeneracy is detected.
theta_grid_optimization_list -- This highly experimental feature may allow the user to address model degeneracy arising from a suboptimal theta initialization. It performs a grid search around the theta values calculated via MPLE to select a potentially improved initialization. The runtime complexity of this feature grows exponentially in the size of the grid and number of parameters -- use with great care. This feature may only be used if hyperparameter_optimization = TRUE, and if a list object of the following form is provided:
theta_grid_optimization_list = list( grid_steps = 2, step_size = 0.5, cores = 2, iteration_fraction = 0.5).
grid_steps indicates the number of steps out the grid search will perform, step_size indicates the fraction of the MPLE theta estimate that each grid search step will change by, cores indicates the number of cores to be used for parallel optimization, and iteration_fraction indicates the fraction of the number of MCMC iterations that will be used for each grid point (should be set less than 1 to speed up optimization). In general grid_steps should be smaller the more structural parameters the user wishes to specify. For example, with 5 structural parameters (mutual, ttriads, etc.), grid_steps = 3 will result in a (2*3+1)^5 = 16807 parameter grid search. Again this feature is highly experimental and should only be used as a last resort (after playing with exponential downweighting and the MPLE_gain_factor).
weighted_MPLE -- Logical indicating whether the estimation routine should use MPLE with weights on statistic values (via numerical approximation). This is recommended if the user is specifying
alpha weights on any endogenous statistics that are less than 1. It will tend to slow down MPLE, but is optimized for parallelization, so it can take advantage of the
parallel -- Logical indicating whether any operations that are "embarrassingly parallel" should be calculated in parallel acros a number of cores specified by the
cores parameter. Since the operations governed by this parameter are easy to parallelize, it is possible to realize a significant speedup in their calculation, particularly when using weighted MPLE for large networks.
parallel_statistic_calculation -- This option turns on or off parallelization via threading in the endogenous statistic calculation for the Metropolis Hastings procedure. This will only really show a significant speedup in large graphs, when the user has specified several endogenous terms to be calculated, as each will be calculated in parallel. This option will actually slow down estimation when used for networks smaller than about 40-50 nodes, as threading requires some computational overhead.
cores -- If
parallel = TRUE and/or
parallel_statistic_calculation = TRUE, this is the number of cores that will be allocated for parallelization. Don't set this to a number greater than the number of threads your computer can support.
use_stochastic_MH -- This is logical indicating whether a stochastic approximation to the h statistics should be used under Metropolis Hastings to determine whether proposals are accepted or rejected (in-between the samples we save to use for MCMCMLE). We currently do dyad/triad sampling with biased sampling towards dyads/triads whose value is close to the mean. This is EXTREMELY EXPERIMENTAL.
stochastic_MH_proportion -- Percentage of dyads/triads to use for the stochastic approximation above, defaults to 0.25. This is basically untested, but probably needs to be higher to get a good enough approximation for smaller networks, although the hope is that it could be a smaller proportion as graph size grows.
estimate_model -- Logical indicating whether estimation should be performed or whether a GERGM object should be returned before estimation. Defaults to TRUE. Setting this to FALSE can be useful for diagnosing bugs.
slackr_integration_list -- An optional list object that contains information necessary to provide updates about model fitting progress to a Slack channel (https://slack.com/) of your choosing. This can be useful if models take a long time to run, and you wish to receive updates on their progress (or if they become degenerate). Before you can get this option to work, you will need to turn on "web hook integration" for your slack channel. If you go to the
Chanel Settings ->
Add an app or integration pane and search for "web hook integration", you should be able to turn this option on. From there you will get an incoming webhook url that you will need to copy and save for use with the
gergm() function. It is also probably advisable to create a separate channel for updates, as these may end up swamping
#general. Once you have a channel and incoming webhook url, you will need to enter your information as a list argument in the
gergm(). The list object must be of the following form:
slackr_integration_list = list( model_name = "descriptive model name", channel = "#yourchannelname", incoming_webhook_url = "https://hooks.slack.com/services/XX/YY/ZZ")
Note that the
model_name is a way to distinguish multiple models that can simultaneously post information to the same slack channel if you so desire, because
model_name will appear as the name of the poster on the slack channel. If all goes well, and the computer you are running the GERGM estimation on has internet access, your slack channel will receive updates when you start estimation, after each lambda/theta parameter update, if the model becomes degenerate, and when it completes running. This feature is still in development.
So far, this package has been tested successfully on OSX, CentOS 7, Ubuntu and Windows 7. Please email me at email@example.com if you have success on another OS or run into any problems.