Import foreign statistical formats into R via the embedded 'ReadStat' C library, < https://github.com/WizardMac/ReadStat>.
Haven enables R to read and write various data formats used by other statistical packages by wrapping the fantastic ReadStat C library written by Evan Miller. Haven is part of the tidyverse. Currently it supports:
.sas7bcat files and
reads SAS transport files (version 5 and version 8).
.sav files and
read_por() reads the
.dta files (up to version 14).
.dta files (versions 8-14).
The output objects:
Are tibbles, which have a better print method for very long and very wide files.
Translate value labels into a new
labelled() class, which preserves the
original semantics and can easily be coerced to factors with
Special missing values are preserved. See
Dates and times are converted to R date/time classes. Character vectors are not converted to factors.
# The easiest way to get haven is to install the whole tidyverse:install.packages("tidyverse")# Alternatively, install just haven:install.packages("haven")# Or the the development version from GitHub:# install.packages("devtools")devtools::install_github("tidyverse/haven")
library(haven)# SASread_sas("mtcars.sas7bdat")write_sas(mtcars, "mtcars.sas7bdat")# SPSSread_sav("mtcars.sav")write_sav(mtcars, "mtcars.sav")# Stataread_dta("mtcars.dta")write_dta(mtcars, "mtcars.dta")
haven can read and write non-ASCII paths in R 3.5 (#371).
labelled_spss objects preserve their attributes when subsetted
read_sav() gains an
encoding argument to override the encoding stored in
the file (#305).
read_sav() can now read
.zsav files (#338).
write_*() functions now invisibly return the input data frame
(as documented) (#349, @austensen).
write_dta() allows non-ASCII variable labels for version 14 and above
(#383). It also uses a less strict check for integers so that a
labelled double containing only integer values can written (#343).
.zsav files when
compress = TRUE (#338).
write_xpt() can now set the "member" name, which defaults to the file name
san extension (#328).
Update to latest readstat.
Fix for when
as_factor() with option
levels="labels" is used on tagged NAs
Update to latest readstat. Includes:
encodingnow affects value labels (#325)
read_xpt() now correctly preserve attributes if
output needs to be reallocated (which is typical behaviour) (#313)
read_sas() recognises date/times format with trailing separator and width
read_sas() gains a
catalog_encoding argument so you can independently
specify encoding of data and catalog (#312)
write_*() correctly measures lengths of non-ASCII labels (#258): this
fixes the cryptic error "A provided string value was longer than the
available storage size of the specified column."
write_dta() now checks for bad labels in all columns, not just the first
write_sav() no longer fails on empty factors or factors with an
level (#301) and writes out more metadata for
Update to latest readstat. Includes:
as_factor() with forcats package (#256)
read_sav() once again correctly returns system defined missings
NA (rather than
SPSS's display widths (@ecortens).
read_sas() gains experimental
cols_only argument to only read in
specified columns (#248).
tibbles are created with
tibble::as_tibble(), rather than by "hand" (#229).
write_sav() checks that factors don't have levels with >120
write_dta() no longer checks that all value labels are at most 32
characters (since this is not a restriction of dta files) (#239).
All write methds now check that you're trying to write a data frame (#287).
Add support for reading (
read_xpt()) and writing (
write_* functions turn ordered factors into labelled vectors (#285)
The ReadStat library is stored in a subdirectory of
src (#209, @krlmlr).
Import tibble so that tibbles are printed consistently (#154, @krlmlr).
Update to latest ReadStat (#65). Includes:
Added support for reading and writing variable formats. Similarly to
to variable labels, formats are stored as an attribute on the vector.
zap_formats() if you want to remove these attributes.
(@gorcha, #119, #123).
Added support for reading file "label" and "notes". These are not currently printed, but are stored in the attributes if you need to access them (#186).
Added support for "tagged" missing values (in Stata these are called "extended" and in SAS these are called "special") which carry an extra byte of information: a character label from "a" to "z". The downside of this change is that all integer columns are now converted to doubles, to support the encoding of the tag in the payload of a NaN.
labelled_spss() is a subclass of
labelled() that can model
user missing values from SPSS. These can either be a set of distinct
values, or for numeric vectors, a range.
zap_labels() strips labels,
and replaces user-defined missing values with
just replaces user-defined missing vlaues with
labelled_spss() is potentially dangerous to work with in R because
base functions don't know about
labelled_spss() functions so will
return the wrong result in the presence of user-defined missing values.
For this reason, they will only be created by
user_na = TRUE (normally user-defined missings are converted to
as_factor() no longer drops the
label attribute (variable label) when
used (#177, @itsdalmo).
levels = "default or
levels = "both" preserves
unused labels (implicit missing) when converting (#172, @itsdalmo). Labels
(and the resulting factor levels) are always sorted by values.
as_factor() gains a new
levels = "default" mechanism. This uses the
labels where present, and otherwise uses the labels. This is now the
default, as it seems to map better to the semantics of labelled values
in other statistical packages (#81). You can also use
levels = "both"
to combine the value and the label into a single string (#82). It also
gains a method for data frames, so you can easily convert every labelled
column to a factor in one function call.
vignette("semantics", package = "haven") discusses the semantics
of missing values and labelling in SAS, SPSS, and Stata, and how they
are translated into R.
hms() has been moved into the hms package (#162).
Time varibles now have class
c("hms", "difftime") and a
with value "secs" (#162).
labelled() is less strict with its checks: you can mix double and integer
value and labels (#86, #110, @lionel-), and
is.labelled() is now exported
(#124). Putting a labelled vector in a data frame now generates the correct
column name (#193).
read_dta() now recognises "%d" and custom date types (#80, #130).
It also gains an encoding parameter which you can use to override
the default encoding. This is particularly useful for Stata 13 and below
which did not store the encoding used in the file (#163).
read_por() now actually works (#35).
read_sav() now correctly recognises EDATE and JDATE formats as dates (#72).
Variables with format DATE, ADATE, EDATE, JDATE or SDATE are imported as
Date variables instead of
POSIXct. You can now set
user_na = TRUE to
preserve user defined missing values: they will be given class
read_sav() have a better test for missing
string values (#79). They can all read from connections and compressed files
read_sas() gains an encoding parameter to overide the encoding stored
in the file if it is incorrect (#176). It gets better argument names (#214).
type_sum() method for labelled objects so they print nicely in
write_dta() now verifies that variable names are valid Stata variables
(#132), and throws an error if you attempt to save a labelled vector that
is not an integer (#144). You can choose which
version of Stata's file
format to output (#217).
write_sas() allows you to write data frames out to
files. This is still somewhat experimental.
write_sav() writes hms variables to SPSS time variables, and the
"measure" type is set for each variable (#133).
write_sav() support writing date and date/times
(#25, #139, #145). Labelled values are always converted to UTF-8 before
being written out (#87). Infinite values are now converted to missing values
since SPSS and Stata don't support them (#149). Both use a better test
for missing values (#70).
zap_labels() has been completely overhauled. It now works
(@markriseley, #69), and only drops label attributes; it no longer replaces
labelled values with
NAs. It also gains a data frame method that zaps
the labels from every column.
print.labelled_spss() now display the type.
fixed a bug in
as_factor.labelled, which generated 's and wrong
labels for integer labels.
zap_labels() now leaves unlabelled vectors unchanged, making it easier
to apply to all columns.
write_sav() take more care to always write output as
write_sav() won't crash if you give them invalid paths,
and you can now use
~ to refer to your home directory (#37).
Byte variables are now correctly read into integers (not strings, #45), and missing values are captured correctly (#43).
read_stata() as alias to
read_spss() uses extension to automatically choose between
Updates from ReadStat. Including fixes for various parsing bugs, more encodings, and better support for large files.
hms objects deal better with missings when printing.
Fixed bug causing labels for numeric variables to be read in as
integers and associated error:
Error: `x` and `labels` must be same type