Lexicons for Text Analysis

A collection of lexical hash tables, dictionaries, and word lists.


lexicon

Project Status: Active - The project has reached a stable, usablestate and is being activelydeveloped. BuildStatus

Table of Contents

Description

lexicon is a collection of lexical hash tables, dictionaries, and word lists. The data prefixes help to categorize the data types:

Prefix Meaning
key_ A data.frame with a lookup and return value
hash_ A keyed data.table hash table
freq_ A data.table of terms with frequencies
profanity_ A profane words vector
pos_ A part of speech vector
pos_df_ A part of speech data.frame
sw_ A stopword vector

Data

Data Description

common_names

First Names (U.S.)

constraining_loughran_mcdonald

Loughran-McDonald Constraining Words

discourse_markers_alemany

Alemany's Discourse Markers

dodds_sentiment

Language Assessment by Mechanical Turk Sentiment Words

emojis_sentiment

Emoji Sentiment Data

enable_word_list

ENABLE Word List

freq_first_names

Frequent U.S. First Names

freq_last_names

Frequent U.S. Last Names

function_words

Function Words

grady_augmented

Augmented List of Grady Ward's English Words and Mark Kantrowitz's Names List

hash_emojis

Emoji Description Lookup Table

hash_emojis_identifier

Emoji Identifier Lookup Table

hash_emoticons

Emoticons

hash_grady_pos

Grady Ward's Moby Parts of Speech

hash_internet_slang

List of Internet Slang and Corresponding Meanings

hash_lemmas

Lemmatization List

hash_power

Power Lookup Key

hash_sentiment_emojis

Emoji Sentiment Polarity Lookup Table

hash_sentiment_huliu

Hu Liu Polarity Lookup Table

hash_sentiment_inquirer

Inquirer Polarity Lookup Table

hash_sentiment_jockers

Jockers Sentiment Polarity Table

hash_sentiment_jockers_rinker

Combined Jockers & Rinker Polarity Lookup Table

hash_sentiment_loughran_mcdonald

Loughran-McDonald Polarity Table

hash_sentiment_nrc

NRC Sentiment Polarity Table

hash_sentiment_senticnet

Augmented SenticNet Polarity Table

hash_sentiment_sentiword

Augmented Sentiword Polarity Table

hash_sentiment_slangsd

SlangSD Sentiment Polarity Table

hash_sentiment_socal_google

SO-CAL Google Polarity Table

hash_sentiment_vadar

Filtered VADAR Polarity Table

hash_strength

Strength Lookup Key

hash_syllable

Syllable Counts

hash_valence_shifters

Valence Shifters

key_abbreviation

Common Abbreviations

key_contractions

Contraction Conversions

key_grade

Grades Hash

key_rating

Ratings Data Set

key_sentiment_jockers

Jockers Sentiment Data Set

modal_loughran_mcdonald

Loughran-McDonald Modal List

nrc_emotions

NRC Emotions

pos_action_verb

Action Word List

pos_adverb

Adverb Word List

pos_df_irregular_nouns

Irregular Nouns Word Dataframe

pos_df_pronouns

Pronouns

pos_interjections

Interjections

pos_preposition

Preposition Words

pos_unchanging_nouns

Nouns that are the Same Plural/Singular

profanity_alvarez

Alejandro U. Alvarez's List of Profane Words

profanity_arr_bad

Stackoverflow user2592414's List of Profane Words

profanity_banned

bannedwordlist.com's List of Profane Words

profanity_google

Google's List of Profane Words

profanity_von_ahn

Luis von Ahn's List of Profane Words

sw_buckley_salton

Buckley & Salton Stopword List

sw_dolch

Leveled Dolch List of 220 Common Words

sw_fry_100

Fry's 100 Most Commonly Used English Words

sw_fry_1000

Fry's 1000 Most Commonly Used English Words

sw_fry_200

Fry's 200 Most Commonly Used English Words

sw_fry_25

Fry's 25 Most Commonly Used English Words

sw_jockers

Matthew Jocker's Expanded Topic Modeling Stopword List

sw_loughran_mcdonald_long

Loughran-McDonald Long Stopword List

sw_loughran_mcdonald_short

Loughran-McDonald Short Stopword List

sw_lucene

Lucene Stopword List

sw_mallet

MALLET Stopword List

sw_onix

Onix Text Retrieval Toolkit Stopword List 1

sw_python

Python Stopword List

Installation

To download the development version of lexicon:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/lexicon")

Contact

You are welcome to:

News

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

..

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

lexicon 0.7.0 - 0.7.4

BUG FIXES

  • emojis_sentiment & hash_emojis contained non-ASCII characters. These have been removed.

NEW FEATURES

  • hash_sentiment_socal_google and hash_sentiment_slangsd sentiment hash tables added for use in the sentimentr package.

  • hash_internet_slag added as a lexicon to map slang to understood meaning.

  • enable_word_list added. This is the Enhanced North American Benchmark Lexicon (ENABLE) which is used in the game Words With Friends (https://en.wikipedia.org/wiki/Words_with_Friends).

CHANGES

  • The columns n_pos, space, & primary have been removed from the hash_grady_pos data set to save space. The grady_pos_feature function can be used to re-add these columns.

lexicon 0.5.0 - 0.6.3

NEW FEATURES

  • sw_mallete, sw_jockers, sw_python, sw_lucene, sw_loughran_mcdonald_short, & sw_loughran_mcdonald_long stopword lists added.

  • hash_sentiment_senticnet hash_sentiment_vadar, hash_sentiment_inquirer, hash_sentiment_loughran_mcdonald, hash_sentiment_emojis & hash_sentiment_jockers_rinker sentiment hash tables added for use in the sentimentr package.

  • modal_loughran_mcdonald added; a data.table of weak, moderate, and strong modal verbs.

  • constraining_loughran_mcdonald added, a vector of words that are associated with constraining.

  • hash_emojis and emojis_sentiment data sets added for text analysis with emojis.

IMPROVEMENTS

  • hash_valence_shifters added following negators: "daren't", "hadn't", "needn't", "oughtn't"; the following amplifiers: "absolutely", "considerably", "decidedly", "especially", "majorly", "most", "uber"; the following de-amplifiers: "almost", "kind of", "kinda", "partly", "somewhat", "sort of", "sorta". In addition, all contraction negators were re-added to the hash_valence_shifters sans apostrophe as cleaning or less formal writing may result in contractions without apostrophes.

CHANGES

lexicon 0.4.0 - 0.4.1

BUG FIXES

  • function_words contained duplicates that have been been removed.

  • hash_lemmas contained an erroneous token-lemma pair (also-conjurer). This was spotted by Mitchell Linegar (see https://github.com/trinker/textstem/issues/5). The token also has been removed from the dictionary.

NEW FEATURES

  • pos_df_irregular_nouns and pos_unchanging_nouns added. The former is a data.frame of singular and plural forms of irregular nouns. The latter is a simple list of irregular nouns that have the same singular and plural forms.

  • profanity_alvarez, profanity_arr_bad, profanity_banned, profanity_google, & profanity_von_ahn added to give access to profanity word lists.

lexicon 0.3.0 - 0.3.1

BUG FIXES

  • freq_first_names and freq_last_names were just a string of the data set name. This has been updated with the actual data set.

NEW FEATURES

  • available_data added to see what data sets are available in lexicon.

lexicon 0.2.0

NEW FEATURES

  • hash_sentiment_jockers and key_sentiment_jockers added as objects though they are not data objects but for all purposes act the same. These data sets come from syuzhet's custom dictionary built by Jockers.

CHANGES

  • hash_sentiment and hash_sentiword renamed to hash_sentiment_huliu and hash_sentiment_sentiword for consistency.

lexicon 0.1.1

NEW FEATURES

  • hash_grady_pos added to provide a lookup of Grady's parts of speech for words.

  • hash_lemmas added to provide a lookup of Mechura's lemmatization list.

  • hash_sentiment_jockers and key_sentiment_jockers added as objects though they are not data objects but for all purposes act the same. These data sets come from syuzhet's custom dictionary built by Jockers.

lexicon 0.1.0

NEW FEATURES

  • The ratings and grades keys from sentimentr have been moved to the lexicon package and renamed to key_rating and key_grade.

IMPROVEMENTS

  • Added the positve terms 'spot on', 'on time', & 'on point' to hash_sentiment.

lexicon 0.0.1

This package is a collection of lexical hash tables, dictionaries, and word lists.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("lexicon")

1.0.0 by Tyler Rinker, 2 months ago


https://github.com/trinker/lexicon


Report a bug at https://github.com/trinker/lexicon/issues?state=open


Browse source code at https://github.com/cran/lexicon


Authors: Tyler Rinker [aut, cre, cph], University of Notre Dame [dtc, cph], Department of Knowledge Technologies [dtc, cph], Unicode, Inc. [dtc, cph], John Higgins [dtc, cph], Grady Ward [dtc], Heiko Possel [dtc], Michal Boleslav Mechura [dtc, cph], Bing Liu [dtc], Minqing Hu [dtc], Saif M. Mohammad [dtc], Peter Turney [dtc], Erik Cambria [dtc], Soujanya Poria [dtc], Rajiv Bajpai [dtc], Bjoern Schuller [dtc], SentiWordNet [dtc, cph], Liang Wu [dtc, cph], Fred Morstatter [dtc, cph], Huan Liu [dtc, cph], Grammar Revolution [dtc, cph], Vidar Holen [dtc, cph], Alejandro U. Alvarez [dtc, cph], Stackoverflow User user2592414 [dtc, cph], BannedWordList.com [dtc, cph], Apache Software Foundation [dtc, cph], Andrew Kachites McCallum [dtc, cph], Alireza Savand [dtc, cph]


Documentation:   PDF Manual  


GPL-3 license


Imports data.table, syuzhet


Imported by sentimentr, textclean, textstem.


See at CRAN