Infers state-recorded gender categories from first names and dates of birth using historical datasets. By using these datasets instead of lists of male and female names, this package is able to more accurately infer the gender of a name, and it is able to report the probability that a name was male or female. GUIDELINES: This method must be used cautiously and responsibly. Please be sure to see the guidelines and warnings about usage in the 'README' or the package documentation. See Blevins and Mullen (2015) < http://www.digitalhumanities.org/dhq/vol/9/3/000223/000223.html>.
An R package for predicting gender from first names using historical data.
Data sets, historical or otherwise, often contain a list of first names but seldom identify those names by gender. Most techniques for finding gender programmatically rely on lists of male and female names. However, the gender associated with names can vary over time. Any data set that covers the normal span of a human life will require a historical method to find gender from names. This R package uses historical datasets from the U.S. Social Security Administration, the U.S. Census Bureau (via IPUMS USA), and the North Atlantic Population Project to provide predictions of gender for first names for particular countries and time periods.
You can install this package from CRAN:
install.packages("genderdata", type = "source",repos = "")
If you prefer, you can install the development versions of both packages from the rOpenSci package repository:
install.packages(c("gender", "genderdata"),repos = "",type = "source")
gender() function takes a character vector of names and a year or
range of years and uses various datasets to predict the gender of names.
Here we predict the gender of the names Madison and Hillary in 1930 and
again in the 2000s using Social Security data.
library(gender)gender(c("Madison", "Hillary"), years = 1930, method = "ssa")#> # A tibble: 2 x 6#> name proportion_male proportion_female gender year_min year_max#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>#> 1 Hillary 1. 0. male 1930. 1930.#> 2 Madison 1. 0. male 1930. 1930.gender(c("Madison", "Hillary"), years = c(2000, 2010), method = "ssa")#> # A tibble: 2 x 6#> name proportion_male proportion_female gender year_min year_max#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>#> 1 Hillary 0.00550 0.994 female 2000. 2010.#> 2 Madison 0.00460 0.995 female 2000. 2010.
See the package vignette for a fuller introduction and suggestions on
how to use the
gender() function efficiently with large datasets.
vignette(topic = "predicting-gender", package = "gender")
To read the documentation for the datasets, install the genderdata package then examine the included datasets.
library(genderdata)data(package = "genderdata")
If you use this package, I would appreciate a citation. You can see an
up to date citation information with
citation("gender"). You can cite
either the package or the accompanying journal article.
Historical Data. R package version 0.5.2. https://github.com/ropensci/gender
Cameron Blevins and Lincoln Mullen, “Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction,” Digital Humanities Quarterly 9, no. 3 (2015): http://www.digitalhumanities.org/dhq/vol/9/3/000223/000223.html
genderdatapackage as binary
install.packages()from the rOpenSci package repository instead of using
gender()to data frames
ipumsmethod that predicts gender before 1930 using U.S. Census data from IPUMS (contributed by Benjamin Schmidt).
genderimplements gender lookup for names and data frames