Data sets and utilities from Project MOSAIC (< http://mosaic-web.org>) used to teach mathematics, statistics, computation and modeling. Funded by the NSF, Project MOSAIC is a community of educators working to tie together aspects of quantitative work that students in science, technology, engineering and mathematics will need in their professional lives, but which are usually taught in isolation, if at all.
The mosaic package is designed to facilitate the use of R in statistics and calculus instruction by providing a number of functions that (a) make many common tasks fit into a common template, and (b) simplify some tasks that would otherwise be too complicated for beginners.
You install from CRAN using
install.packages("ggplot2") # Get the newest version of ggplot2 FIRSTinstall.packages("mosaic")
or from github with
If you want to try out our developmental code (the beta branch), use
Updates to the master github repository are more frequent than CRAN updates. Our beta branch is where we implement bug fixes most quickly and develop new features. We try to keep it pretty stable, but there may be a few rough edges, missing documentation, etc. while things are in progress.
If you discover a problem with any version of the package, be sure to let us know so that we can address it. Post an issue on github or send email to
The package includes several vignettes to help you get started. One of these vignettes (Resources Related to the mosaic package) includes a list of many resources, both within the package and external to it. That's a good place to start.
Need help? Try posting a question on Stack Overflow using the tag r-mosaic.
Project MOSAIC is a community of educators working to develop a new way to introduce mathematics, statistics, computation and modeling to students in colleges and universities.
Our goal: Provide a broader approach to quantitative studies that provides better support for work in science and technology. The focus of the project is to tie together better diverse aspects of quantitative work that students in science, technology, and engineering will need in their professional lives, but which are today usually taught in isolation, if at all.
The name MOSAIC reflects the first letters --- M, S, C, C --- of these important components of a quantitative education. Project MOSAIC is motivated by a vision of quantitative education as a mosaic where the basic materials come together to form a complete and compelling picture.
Find out more about Project MOSAIC at [http://mosaic-web.org].
mplot.lm()now removed points with leverage 1 to avoid errors and warnings; a warning messaes notifies which points have been removed.
TukeyHSD()now correctly follows
system = "gg"
ggrapelto place labels and offers additional controls for the smooth curve that is overlaid. [gg version of plots only]
relrisk()now accept a 2x2 data frame to match claims in documentation.
prop.test()so it handles
successargument properly for 2-way tables.
whichargument added to
xct()added to find central portions of distributions.
mplot()on linear models when system =
xpnorm()and friends now use
ggplot2and can return the plot object, if requested.
t.test()has been completely reimplemented. It no longer supports "bare variable mode", but it is more similar to
stats::t.test()in some cases.
gwm()has beeen removed since it no longer works with the current version of
counts()have been added. They are a bit like
tally()but designed to play well with
df_stats(). Currently the formula versions drop missing data, but that will likely be determined by a user-supplied option in the future.
ggformula, so users will have
ggformulaavailable after loading
mplot()on a data frame supports
ggformulahas been added.
ggformulahas been added.
mosaicCore. This should not affect users of
tally()now provide names to dimnames in cases where they were previously missing. This was needed for the refactoring of
tally()for tabulation. This means the behavior of
bargraph()should match expections of users of
tally()better than it did before. In particular, proportions now sum to 1 in each panel of a multi-panel plot.
tally()so the proportions computed when
format = "proportion"are easier to predict.
prop(x ~ y)was reporting overall proportions rather than marginal proportions.
value(), a generic with several methods for extracting a "value" from a more complicated object. Useful for extracting values from output of
cubature::adaptIntegrat()without needing to know just how those values are stored in the object.
prop(a ~ b)to compute joint rather than conditional proportions.
sd(), etc.) now require that the first argument be a formula. This was always the preferred method, but some functions allowed bare variable names to be used instead. As a specific example, the following code now generates an error (unless there is another object named
agein your environment).
favstats(age, data = HELPrct) ## Error in typeof(x) : object 'age' not found
Replace this with
favstats( ~ age, data = HELPrct) ## min Q1 median Q3 max mean sd n missing ## 19 30 35 40 60 35.65342 7.710266 453 0
mplot.data.frame()allow it to work with an expression that evaluates to a data frame. ASH plots are now a choice for 1-variable plots.
deltaMethod()has been moved to a separate package (called
deltaMethod) to reduce package dependencies
cull_for_do.lm()now returns a data frame instead of a vector. This makes it easier for
do()to bind things together by column name.
makeMap()updated to work with new version of
rdata()have been reordered so that the formula comes first.
rflip()has been improved.
dfapply(), also default value for
inspect(), which is primarily intended to give an over view of the variables in a data frame, but handles some additional objects as well.
dataargument is not an environment or data frame.
mm()has been deprecated and replaced with
gwm()which does groupwise models where the response may be either categorical or quantitative.
plotModel(). This is likely still not the final version, but we are getting closer.
dotPlot()are now the same size in all panels of multi-panel plots.
cdist()has been rewritten.
mplot()on a data frame now (a) prompts the user for the type of plot to create and (b) has an added option to make line plots for time series and the like.
resample()can now do residual resampling from a linear model.
do()to create common bootstrap confidence intervals. In particular,
confint()can now calculate three kinds of intervals in many common situations.
fetchGapminder()have been moved to a separate package, called
plotModel()can be used to show data and model fits for a variety of models created with
mosaicDataa dependency of
mosaic. This avoids the problem of users forgetting to separately load the
read.file()) from future versions of the package. More and more packages are providing utilities for bringing data into R and it doesn't make sense for us to duplicate those efforts in this package. For google sheets, you might take a look at the
googlesheetspackage which is avialable via github now and will be on CRAN soon.
t.test(), which have also undergone some internal restructuring. The objects returned now do a better job of reporting about the test conducted. In particular,
prop.test()will report the value of
binom.test()can now compute several different kinds of confidence intervals including the Wald, Plus-4 and Agresti-Coull intervals. (#449)
derivedFactor()now handles NAs without throwing a warning. (#451)
pdist()and related functions now do a better (i.e., useful) job with discrete distributions (#417)
t.test()and all the "aggregating" functions like
favstats(). In particular, it is now possible to reference variables both in the
dataargument and in the calling environment. (#435)
CIAdata()now provides a message indicating the source URL for the data retrieved (#444)
CIAdata()that seem to be related to a changed in file format at the CIA World Factobook website. The "inflation" data set is still broken (on the CIA website). (#441)
read.file()now uses functions from
readrin some cases. A message is produced indicating which reader is being used. There are also some API changes. In particular, character data will be returned as character rather than factor. See
factorize()for an easy way to convert things with few unique values into factors. (#442)
mutate()is used in place of
transform()in the examples. (#452)
tally()now produces counts by default for all formula shapes. Proportions or percentages must be requested explicitly. This is to avoid common errors, especially when feeding the results into
msummary(). Usually this is identical to
summary(), but for a few kids of objects it provides modified output that is less verbose.
do * lm( )will now keep track of the F statistic, too. \item
confint()applied to an object produced using
do()now does more appropriate things.
success = 1by default on 0-1 data to treat 0 like failure and 1 like success. Similarly,
level = 1by default.
CIsim()can now produce plots and does so by default when
samples <= 200.
swap()which is useful for creating randomization distributions for paired designs. The current implementation is a bit slow.
docFile()introduced to simplify accessing files included with package documentation.
read.file()enhanced to take a package as an argument and look among package documentation files.
factorize()introduced as a way to convert vectors with few unique values into factors. Can be applied to an entire data frame.
NHANESdata set and
mosaicDatacontains the other data sets.
SAD()were added to compute mean and sum of all pairs of absolute differences.
rspin()has been added to simulate spinning a spinner.
mosaicpackage to simplify R for beginners.
plotFun()has been improved so that it does a better job of selecting points where the function is evaluated and no longer warns about
NaNs encountered while exploring the domain of the function.
oddsRatio()has been redesigned and
relrisk()has been added. Use their
verbose=TRUEto see more information (including confidence intervals).
mplot()and several instances have been added to make a number of plots easy to generate. There are methods for objects of classes
"hclust". For several of these there are also
fortifymethods that return the data frame created to facilitate plotting.
read.file()now handles (some?) https URLs and accepts an optional argument
filetypethat can be used to declare the type of data file when it is not identified by extension.
tally()function has changed to
mosaicnow depends on
dplyrboth to use some of its functionality and to avoid naming collisions with functions like
dplyrto coexist more happily.
dotPlot(). In particular, the size of the dots is determined differently and works better more of the time. Dots were also shifted down by .5 units so that they
do()that caused it to scope incorrectly in some edge cases when a variable had the same name as a function.
ntiles()has been reimplemented and now has more formatting options.
derivedFactor()for creating factors from logical "cases".
HELPdata set has been removed from the package.
under=TRUE, making it easy to add plots of distributions over (or under) plots of data (e.g., histograms, densityplots, etc.) or other distributions.
add=TRUEhave been reimplemented using
latticeExtra. See documentaiton of these functions for details.
ladd()has been completely reimplemented using
latticeExtra. See documentation of
ladd()for details, including some behavior changes.
var(), et al) now use
getOptions("na.rm")to determine the default value of
options(na.rm=TRUE)to change the default behavior to remove
NAs and options(na.rm=NULL) to restore defaults.
do()has been largely rewritten with an eye toward improved efficiency. In particular,
do()will take advantage of multiple cores if the
parallelpackage is available. At this point, sluggishness in applications of
do()are mostly likely due to the sluggishness of what is being done, not to
carpackage to make it easier to propagate uncertainty in some situations that commonly arise in the physical sciences and engineering.
cdist()to compute critical values for the central portion of a distribution.
qdata(). For interactive use, this should not cause any problems, but old programmatic uses of
qdata()should be checked as the object returned is now different.
sd(), etc.) to produce counter-intuititve results (but with a warning). The results are now what one would expect (and the warning is removed).
rsquared()for extracting r-squared from models and model-like objects (
r.squared()has been deprecated).
do()now handles ANOVA-like objects better
maggregate()is now built on some improved behind the scenes functions. Among other features, the
groupsargument is now incorporated as an alternative method of specifying the goups to aggregate over and the
methodargument can be set to
plyrpackage for aggregation. This results in a different output format that may be desired in some applications. \item The
qdata()functions have been largely rewritten. In addition,
qdata_f()are provided which produce similar results but have a formula in the first arguemnt slot.
doc/and so are available from within the package as well as via links to external files.
fetchGapminder()for fetching data sets originally from Gapminder.
cdata()for finding end points of a central portion of a variable.
prop()to avoid internal
:which makes downstream processing messier.
plotFun()can be used without
manipulate(). This makes it possible to put surface plots into RMarkdown or Rnw files or to generate them outside of RStudio.
do() * rflip()now records proportion heads as well as counts of heads and tails.
restoreLatticeOptions()to switch back and forth between
dotPlot()uses a different algorithm to determine dot sizes. (Still not perfect, but
cexcan be used to further scale the dots.)
nintmatches the number of bins used more accurately.
i2: max number of drinks is at least as large as
i1: the average number of drinks.
mPlot()provides an interactive environment for creating
sp2df()for converting SpatialPolygonDataFrames to regular data frames (which is useful for plotting with
ggplot2, for example). Also the
Countriesdata frame facilitates mapping country names among different sources of map data.
do()are now marked as such so that
confint()can behave differently for such data frames and for "regular" data frames.
t.test()can now do 1-sample t-test described using a formula.
var(), etc. using a formula interface) have been completely reimplemented and additional aggregating functions are provided.
ntiles()function has been added to facilitate creating factors based on quantile ranges.
xhistogram()is now deprecated. Use
var(), etc.) now use
getOption('na.rm')to determine default behavior.
var()allow it to work in a wider range of situations.
TukeyHSD()so that explicit use of
aov()is no longer required
panel.lmbands()for plotting confidence and prediction bands in linear regression
MASShas been removed by renaming the data set
freqpolygon()for making frequency polygons.
r.squared()for extracting r-squared from models and model-like objects.
do()so that hyphens ('-') are turned into dots ('.')
We are still in beta, but we hope things are beginning to stabilize as we settle on syntax and coding idioms for the package. Here are some of the key updates since 0.4:
lm()and its cousins.
makeFun()now has methods for glm and nls objects
D()improved to use symbolic differentiation in more cases and allow pass through to
stats::D()when that makes sense. This allows functions like deltaMethod() from the car package to work properly even when the mosaic package is loaded.
antiD()has been modified somewhat. This may go through another revision if/when we add in symbolic differentiation, but we think we are now close to the end state.
fitModel()have been added as wrappers around linear models using ns(), bs(), and nls(). Each of these returns the model fit as a function.