Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.
This is another small release to fix compatibility with
readr v1.3.0 and
tibble v2.0.0. There are no other changes.
This is a small release, mainly to fix compatibility with version
readr. There is one breaking change however:
jst_get_refernceshave been renamed to avoid ambiguity when matching with output from
jst_get_article. All columns now have a
sample_has been replaced with
article_or removed altogether.
jst_define_importdue to upcoming release of
jst_get_journal_overview(most_recent = T)) had to be removed due changes on their server. I will try to find a solution with JSTOR support so we can add the functionality again.
jst_define_importnow prints the specification in a pretty and informative way.
jst_define_importnow checks the definition more extensively:
jst_define_import(article = jst_get_book)or similar mis-specifications will raise an error.
find_*functions is now defunct (they raise an error).
This is a hotfix to resolve an issue with writing to other directories than temporary folders during tests, which should not have happend in the first place.
jstoris now part of rOpenSci.
jst_add_total_pages, since both built on the dev version of rlang. Once this version is on CRAN, they will be re-introduced.
jst_import_zip now use futures as a backend for parallel
processing. This makes internals more compact and reduces dependencies.
Furthermore this reduces the number of arguments, since the argument
has been removed. By default, the functions run sequentially. If you want them
to execute in parallel, use futures:
library(future) plan(multiprocess) jst_import_zip("zip-archive.zip", import_spec = jst_define_import(article = jst_get_article), out_file = "outfile")
If you want to terminate the proceses, at least on *nix-systems you need to kill them manually (once again).
jst_*. The former group of
find_*functions is now called
jst_get_*, as in
jst_get_article(). The previous functions have been deprecated and will be removed before submission to CRAN.
file_name, and the corresponding helper to get this file name from
There is a new set of functions which lets you directly import files from
In the following example, we have a zip-archive from DfR and want to import
metadata on books and articles. For all articles we want to apply
jst_get_authors(), for books only
and we want to read unigrams (ngram1).
First we specify what we want, and then we apply it to our zip-archive:
# specify definitionimport_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),book = jst_get_book,ngram1 = jst_get_ngram)# apply definition to archivejst_import_zip("zip_archive.zip",import_spec = import_spec,out_file = "out_path")
If the archive contains also research reports, pamphlets or other ngrams, they will not be imported. We could however change our specification, if we wanted to import all kinds of ngrams (given that we originally requested them from DfR):
# import multiple forms of ngramsimport_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),book = jst_get_book,ngram1 = jst_get_ngram,ngram2 = jst_get_ngram,ngram3 = jst_get_ngram)
Note however that for larger archives, importing all ngrams takes a very long
time. It is thus advisable to only import ngrams for articles which you
want to analyse, i.e. most likely a subset of the initial request. The new
jst_subset_ngrams() helps you with this (see also the section on
importing bigrams in the
Before importing all files from a zip-archive, you can get a quick overview with
vignette("known-quirks") lists common problems with data from
JSTOR/DfR. Contributions with further cases are welcome!
jst_get_journal_overview()supplies a tibble with contextual information about the journals in JSTOR.
jst_re_import()to a whole directory and lets you combine all related files in one go. It uses the file structure that
jst_import_zip()provide as a heuristic: a filename with a dash and one or multiple digits at its end (
filename-1.csv). All files with identical names (disregarding dash and digits) are combined into one file.
jst_re_import()lets you re_import a
jst_import_zip()had exported. It tries to guess the type of content based on the column names or, if column names are not available, from the number of columns, raising a warning if guessing fails and reverting to a generic import.
jst_subset_ngrams()lets you create a subset of ngram files within a zip-file which you can import with
jst_clean_page()tries to turn a character vector with pages into a numeric one,
jst_unify_journal_id()merges different specifications of journals into one,
jst_add_total_pages()adds a total count of pages per article, and
jst_augment()calls all three functions to clean the data set in one go.
n_batcheswhich lets you specify the number of batches directly
snowas a backend for
jstor_importnow writes column names by default #29
get_basenamehelps to get the basename of a file without its extension
find_articledoes not coerce days and months to integer any more, since there might be information stored as text.
NEWS.mdfile to track changes to the package.