Predict Gender from Names Using Historical Data

Encodes gender based on names and dates of birth using historical datasets. By using these datasets instead of lists of male and female names, this package is able to more accurately guess the gender of a name, and it is able to report the probability that a name was male or female.


Data sets, historical or otherwise, often contain a list of first names but seldom identify those names by gender. Most techniques for finding gender programmatically rely on lists of male and female names. However, the gender associated with names can vary over time. Any data set that covers the normal span of a human life will require a historical method to find gender from names. This R package uses historical datasets from the U.S. Social Security Administration, the U.S. Census Bureau (via IPUMS USA), and the North Atlantic Population Project to provide predictions of gender for first names for particular countries and time periods.

You can install this package from CRAN:

install.packages("gender")

The first time you use the package you will be prompted to install the accompanying genderdata package. Alternatively, you can install this package for yourself from the rOpenSci package repository:

install.packages("genderdata", type = "source",
                 repos = "http://packages.ropensci.org")

If you prefer, you can install the development versions of both packages from the rOpenSci package repository:

install.packages(c("gender", "genderdata"),
                 repos = "http://packages.ropensci.org",
                 type = "source")

The gender() function takes a character vector of names and a year or range of years and uses various datasets to predict the gender of names. Here we predict the gender of the names Madison and Hillary in 1930 and again in the 2000s using Social Security data.

library(gender)
gender(c("Madison", "Hillary"), years = 1930, method = "ssa")
#> 
#>      name proportion_male proportion_female gender year_min year_max
#>     (chr)           (dbl)             (dbl)  (chr)    (dbl)    (dbl)
#> 1 Hillary               1                 0   male     1930     1930
#> 2 Madison               1                 0   male     1930     1930
gender(c("Madison", "Hillary"), years = c(2000, 2010), method = "ssa")
#> Source: local data frame [2 x 6]
#> 
#>      name proportion_male proportion_female gender year_min year_max
#>     (chr)           (dbl)             (dbl)  (chr)    (dbl)    (dbl)
#> 1 Hillary          0.0055            0.9945 female     2000     2010
#> 2 Madison          0.0046            0.9954 female     2000     2010

See the package vignette or read it online at CRAN for a fuller introduction and suggestions on how to use the gender() function efficiently with large datasets.

vignette(topic = "predicting-gender", package = "gender")

To read the documentation for the datasets, install the genderdata package then examine the included datasets.

library(genderdata)
data(package = "genderdata")

If you use this package, I would appreciate a citation. You can see up to date citation information with citation("gender"). You can cite either the package or the accompanying journal article.

Cameron Blevins and Lincoln Mullen, "Jane, John ... Leslie? A Historical Method for Algorithmic Gender Prediction," Digital Humanities Quarterly (forthcoming 2015).


News

gender 0.5.1

  • bugfix for some users who cannot install the genderdata package as binary

gender 0.5.0

  • genderdata package is installed using install.packages() from the rOpenSci package repository instead of using install_github().
  • all functions always return data frames
  • general performance improvements
  • calls to Genderize.io API no longer fail if the name does not exist
  • new function gender_df() efficiently applies gender() to data frames
  • add North Atlantic Population Project dataset for six European countries

gender 0.4.3

  • updates to README.md as requested by CRAN

gender 0.4.2

  • bugfix: Kantrowitz method is now case-insensitive
  • updates to title and descriptions according to CRAN policy

gender 0.4.1

  • tests and vignettes run without depending on the genderdata package
  • users will be prompted to install the genderdata package from GitHub the first time that it is necessary
  • added a demo mode with a minimal dataset

gender 0.4

  • data is now external to the gender package and is available in the genderdata package.
  • genderdata package can be installed with a new function

gender 0.3

  • rewrote all functions to take only character vectors, not data frames, but provided instructions on how to use with data frames
  • wrote a vignette describing the data sources and explaining the historical methodology behind this package

gender 0.2

  • implemented an ipums method that predicts gender before 1930 using U.S. Census data from IPUMS (contributed by Benjamin Schmidt).

  • upgraded dependency on dplyr to 0.2.

gender 0.1

  • function gender implements gender lookup for names and data frames

  • implemented finding gender by using the Kantrowitz names corpus

  • implemented finding gender by using the national Social Security Administration data for names and dates of birth

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("gender")

0.5.1 by Lincoln Mullen, 2 years ago


https://github.com/ropensci/gender


Report a bug at https://github.com/ropensci/gender/issues


Browse source code at https://github.com/cran/gender


Authors: Lincoln Mullen [aut, cre], Cameron Blevins [ctb], Ben Schmidt [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports dplyr, httr, jsonlite

Depends on utils, stats

Suggests genderdata, ggplot2, knitr, testthat


Imported by qdap.


See at CRAN