Predict Gender from Names Using Historical Data

Infers state-recorded gender categories from first names and dates of birth using historical datasets. By using these datasets instead of lists of male and female names, this package is able to more accurately infer the gender of a name, and it is able to report the probability that a name was male or female. GUIDELINES: This method must be used cautiously and responsibly. Please be sure to see the guidelines and warnings about usage in the 'README' or the package documentation. See Blevins and Mullen (2015) <>.


An R package for predicting gender from first names using historical data.

Author: Lincoln Mullen
License: MIT

CRAN_Status_Badge CRAN_Downloads BuildStatus AppVeyor BuildStatus CoverageStatus

Data sets, historical or otherwise, often contain a list of first names but seldom identify those names by gender. Most techniques for finding gender programmatically rely on lists of male and female names. However, the gender associated with names can vary over time. Any data set that covers the normal span of a human life will require a historical method to find gender from names. This R package uses historical datasets from the U.S. Social Security Administration, the U.S. Census Bureau (via IPUMS USA), and the North Atlantic Population Project to provide predictions of gender for first names for particular countries and time periods.


You can install this package from CRAN:


The first time you use the package you will be prompted to install the accompanying genderdata package. Alternatively, you can install this package for yourself from the rOpenSci package repository:

install.packages("genderdata", type = "source",
                 repos = "")

If you prefer, you can install the development versions of both packages from the rOpenSci package repository:

install.packages(c("gender", "genderdata"),
                 repos = "",
                 type = "source")

Using the package

The gender() function takes a character vector of names and a year or range of years and uses various datasets to predict the gender of names. Here we predict the gender of the names Madison and Hillary in 1930 and again in the 2000s using Social Security data.

gender(c("Madison", "Hillary"), years = 1930, method = "ssa")
#> # A tibble: 2 x 6
#>   name    proportion_male proportion_female gender year_min year_max
#>   <chr>             <dbl>             <dbl> <chr>     <dbl>    <dbl>
#> 1 Hillary              1.                0. male      1930.    1930.
#> 2 Madison              1.                0. male      1930.    1930.
gender(c("Madison", "Hillary"), years = c(2000, 2010), method = "ssa")
#> # A tibble: 2 x 6
#>   name    proportion_male proportion_female gender year_min year_max
#>   <chr>             <dbl>             <dbl> <chr>     <dbl>    <dbl>
#> 1 Hillary         0.00550             0.994 female    2000.    2010.
#> 2 Madison         0.00460             0.995 female    2000.    2010.

See the package vignette for a fuller introduction and suggestions on how to use the gender() function efficiently with large datasets.

vignette(topic = "predicting-gender", package = "gender")

To read the documentation for the datasets, install the genderdata package then examine the included datasets.

data(package = "genderdata")


If you use this package, I would appreciate a citation. You can see an up to date citation information with citation("gender"). You can cite either the package or the accompanying journal article.

Historical Data. R package version 0.5.2.

Cameron Blevins and Lincoln Mullen, “Jane, John … Leslie? A Historical Method for Algorithmic Gender Prediction,” Digital Humanities Quarterly 9, no. 3 (2015):



gender 0.5.2

  • bugfix for change in the API (#50)

gender 0.5.1

  • bugfix for some users who cannot install the genderdata package as binary

gender 0.5.0

  • genderdata package is installed using install.packages() from the rOpenSci package repository instead of using install_github().
  • all functions always return data frames
  • general performance improvements
  • calls to API no longer fail if the name does not exist
  • new function gender_df() efficiently applies gender() to data frames
  • add North Atlantic Population Project dataset for six European countries

gender 0.4.3

  • updates to as requested by CRAN

gender 0.4.2

  • bugfix: Kantrowitz method is now case-insensitive
  • updates to title and descriptions according to CRAN policy

gender 0.4.1

  • tests and vignettes run without depending on the genderdata package
  • users will be prompted to install the genderdata package from GitHub the first time that it is necessary
  • added a demo mode with a minimal dataset

gender 0.4

  • data is now external to the gender package and is available in the genderdata package.
  • genderdata package can be installed with a new function

gender 0.3

  • rewrote all functions to take only character vectors, not data frames, but provided instructions on how to use with data frames
  • wrote a vignette describing the data sources and explaining the historical methodology behind this package

gender 0.2

  • implemented an ipums method that predicts gender before 1930 using U.S. Census data from IPUMS (contributed by Benjamin Schmidt).
  • upgraded dependency on dplyr to 0.2.

gender 0.1

  • function gender implements gender lookup for names and data frames
  • implemented finding gender by using the Kantrowitz names corpus
  • implemented finding gender by using the national Social Security Administration data for names and dates of birth

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


0.6.0 by Lincoln Mullen, 12 days ago

Report a bug at

Browse source code at

Authors: Lincoln Mullen [aut, cre] , Cameron Blevins [ctb] , Ben Schmidt [ctb]

Documentation:   PDF Manual  

MIT + file LICENSE license

Imports dplyr, httr, jsonlite, remotes

Depends on utils, stats

Suggests genderdata, ggplot2, knitr, testthat, rmarkdown, covr

Imported by qdap.

Suggested by corpustools.

See at CRAN