Predict Gender from Brazilian First Names

A method to predict and report gender from Brazilian first names using the Brazilian Institute of Geography and Statistics' Census data.


genderBR

CRAN_Status_Badge Travis-CI Build Status AppVeyor Build Status Package-License

genderBR predicts gender from Brazilian first names using data from the Instituto Brasileiro de Geografia e Estatistica's 2010 Census API.

How does it work?

genderBR's main function is get_gender, which takes a string with a Brazilian first name and predicts its gender using data from the IBGE's 2010 Census.

More specifically, it retrieves data on the number of females and males with the same name in Brazil, or in a given Brazilian state, and calculates the proportion of female's uses of it. The function then classifies a name as male or female only when that proportion is higher than a given threshold (e.g., female if proportion > 0.9, or male if proportion <= 0.1); proportions below those threshold are classified as missing (NA). An example:

library(genderBR)
 
get_gender("joão")
get_gender("ana")
#> [1] "Female"

Multiple names can be passed at the same function call:

get_gender(c("pedro", "maria"))
#> [1] "Male"   "Female"

And both full names and names written in lower or upper case are accepted as inputs:

get_gender("Mario da Silva")
#> [1] "Male"
get_gender("ANA MARIA")
#> [1] "Female"

Additionally, one can filter results by state with the argument state; or get the probability that a given first name belongs to a female person by setting the prob argument to TRUE (defaults to FALSE).

# What is the probability that the name Ariel belongs to a female person in Brazil?
get_gender("Ariel", prob = TRUE)
#> [1] 0.09219289
 
# What about differences between Brazilian states?
get_gender("Ariel", prob = TRUE, state = "RJ") # RJ, Rio de Janeiro
#> [1] 0.2627399
get_gender("Ariel", prob = TRUE, state = "RS") # RS, Rio Grande do Sul
#> [1] 0.05144695
get_gender("Ariel", prob = TRUE, state = "SP") # SP, Sao Paulo
#> [1] 0.1294782

Note that a vector with states' abbreviations is a valid input for get_gender function, so this also works:

name <- rep("Ariel", 3)
states <- c("rj", "rs", "sp")
get_gender(name, prob = T, state = states)
#> [1] 0.26273991 0.05144695 0.12947819

This can be useful also to predict the gender of different individuals living in different states:

df <- data.frame(name = c("Alberto da Silva", "Maria dos Santos", "Thiago Rocha", "Paula Camargo"),
                 uf = c("AC", "SP", "PE", "RS"),
                 stringsAsFactors = FALSE
                 )
 
df$gender <- get_gender(df$name, df$uf)
 
df
#>               name uf gender
#> 1 Alberto da Silva AC   Male
#> 2 Maria dos Santos SP Female
#> 3     Thiago Rocha PE   Male
#> 4    Paula Camargo RS Female

Brazilian state abbreviations

The genderBR package relies on Brazilian state abbreviations (acronyms) to filter results. To get a complete dataset with the full name, IBGE code, and abbreviations of all 27 Brazilian states, use the get_states functions:

get_states()
#> # A tibble: 27 x 3
#>               state   abb  code
#>               <chr> <chr> <int>
#>  1             ACRE    AC    12
#>  2          ALAGOAS    AL    27
#>  3            AMAPA    AP    16
#>  4         AMAZONAS    AM    13
#>  5            BAHIA    BA    29
#>  6            CEARA    CE    23
#>  7 DISTRITO FEDERAL    DF    53
#>  8   ESPIRITO SANTO    ES    32
#>  9            GOIAS    GO    52
#> 10         MARANHAO    MA    21
#> # ... with 17 more rows

Geographic distribution of Brazilian first names

The genderBR package can also be used to get information on the relative and total number of persons with a given name by gender and by state in Brazil. To that end, use the map_gender function:

map_gender("maria")
#> # A tibble: 27 x 6
#>                   nome    uf    freq populacao  sexo     prop
#>  *               <chr> <int>   <int>     <int> <chr>    <dbl>
#>  1               Piauí    22  363139   3118360       11645.19
#>  2               Ceará    23  967042   8452381       11441.06
#>  3             Paraíba    25  423026   3766528       11231.19
#>  4 Rio Grande do Norte    24  341940   3168027       10793.47
#>  5             Alagoas    27  321330   3120494       10297.41
#>  6          Pernambuco    26  838534   8796448        9532.64
#>  7             Sergipe    28  188619   2068017        9120.77
#>  8            Maranhão    21  574689   6574789        8740.80
#>  9                Acre    12   63172    733559        8611.71
#> 10        Minas Gerais    31 1307650  19597330        6672.59
#> # ... with 17 more rows

To specify gender in the consultation, use the optional argument gender (valid inputs are f, for female; m, for male; or NULL, the default option).

map_gender("iris", gender = "m")
#> # A tibble: 23 x 6
#>                nome    uf  freq populacao  sexo  prop
#>  *            <chr> <int> <int>     <int> <chr> <dbl>
#>  1            Goiás    52   840   6003788     m 13.99
#>  2        Tocantins    17   156   1383445     m 11.28
#>  3            Bahia    29   422  14016906     m  3.01
#>  4      Mato Grosso    51    91   3035122     m  3.00
#>  5     Minas Gerais    31   512  19597330     m  2.61
#>  6 Distrito Federal    53    65   2570160     m  2.53
#>  7   Espírito Santo    32    69   3514952     m  1.96
#>  8         Rondônia    11    28   1562409     m  1.79
#>  9             Pará    15   129   7581051     m  1.70
#> 10   Rio de Janeiro    33   225  15989929     m  1.41
#> # ... with 13 more rows

Installing

To install genderBR's last stable version on CRAN, use:

install.packages("genderBR")

To install a development version, use:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("meirelesff/genderBR")

Data

The data used in this package comes from the Instituto Brasileiro de Geografia e Estatistica's (IBGE) 2010 Census. The surveyed population includes 190,8 million Brazilians -- with more than 130,000 unique first names.

To extracts the numer of male and female uses of a given first name in Brazil, the package uses the IBGE's API. In this service, names with different spelling (e.g., Ana and Anna, or Marcos and Markos) are considered different occurrences, and only names with more than 20 occurrences, or more than 15 occurrences in a given state, are included in the database.

For more information on the IBGE's data, please check (in Portuguese): http://censo2010.ibge.gov.br/nomes/

Author

Fernando Meireles

News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("genderBR")

1.1.0 by Fernando Meireles, 2 months ago


https://github.com/meirelesff/genderBR


Report a bug at https://github.com/meirelesff/genderBR/issues


Browse source code at https://github.com/cran/genderBR


Authors: Fernando Meireles [aut, cre]


Documentation:   PDF Manual  


GPL (>= 2) license


Imports dplyr, jsonlite, httr


See at CRAN