Lexicons for Text Analysis

A collection of lexical hash tables, dictionaries, and word lists.


lexicon

Project Status: Active - The project has reached a stable, usablestate and is being activelydeveloped. BuildStatus

Table of Contents

Description

lexicon is a collection of lexical hash tables, dictionaries, and word lists. The data prefixes help to categorize the data types:

Prefix Meaning
key_ A data.frame with a lookup and return value
hash_ A keyed data.table hash table
freq_ A data.table of terms with frequencies
profanity_ A profane words vector
pos_ A part of speech vector
pos_df_ A part of speech data.frame
sw_ A stopword vector

Data

Data Description

common_names

First Names (U.S.)

constraining_loughran_mcdonald

Loughran-McDonald Constraining Words

emojis_sentiment

Emoji Sentiment Data

freq_first_names

Frequent U.S. First Names

freq_last_names

Frequent U.S. Last Names

function_words

Function Words

grady_augmented

Augmented List of Grady Ward's English Words and Mark Kantrowitz's Names List

hash_emojis

Emoji Description Lookup Table

hash_emojis_identifier

Emoji Identifier Lookup Table

hash_emoticons

Emoticons

hash_grady_pos

Grady Ward's Moby Parts of Speech

hash_internet_slang

List of Internet Slang and Corresponding Meanings

hash_lemmas

Lemmatization List

hash_sentiment_emojis

Emoji Sentiment Polarity Lookup Table

hash_sentiment_huliu

Hu Liu Polarity Lookup Table

hash_sentiment_jockers

Jockers Sentiment Polarity Table

hash_sentiment_jockers_rinker

Combined Jockers & Rinker Polarity Lookup Table

hash_sentiment_loughran_mcdonald

Loughran-McDonald Polarity Table

hash_sentiment_nrc

NRC Sentiment Polarity Table

hash_sentiment_senticnet

Augmented SenticNet Polarity Table

hash_sentiment_sentiword

Augmented Sentiword Polarity Table

hash_sentiment_slangsd

SlangSD Sentiment Polarity Table

hash_sentiment_socal_google

SO-CAL Google Polarity Table

hash_valence_shifters

Valence Shifters

key_contractions

Contraction Conversions

key_corporate_social_responsibility

Nadra Pencle and Irina Malaescu's Corporate Social Responsibility Dictionary

key_grade

Grades Data Set

key_rating

Ratings Data Set

key_regressive_imagery

Colin Martindale's English Regressive Imagery Dictionary

key_sentiment_jockers

Jockers Sentiment Data Set

modal_loughran_mcdonald

Loughran-McDonald Modal List

nrc_emotions

NRC Emotions

pos_action_verb

Action Word List

pos_df_irregular_nouns

Irregular Nouns Word Dataframe

pos_df_pronouns

Pronouns

pos_interjections

Interjections

pos_preposition

Preposition Words

profanity_alvarez

Alejandro U. Alvarez's List of Profane Words

profanity_arr_bad

Stackoverflow user2592414's List of Profane Words

profanity_banned

bannedwordlist.com's List of Profane Words

profanity_racist

Titus Wormer's List of Racist Words

profanity_zac_anger

Zac Anger's List of Profane Words

sw_dolch

Leveled Dolch List of 220 Common Words

sw_fry_100

Fry's 100 Most Commonly Used English Words

sw_fry_1000

Fry's 1000 Most Commonly Used English Words

sw_fry_200

Fry's 200 Most Commonly Used English Words

sw_fry_25

Fry's 25 Most Commonly Used English Words

sw_jockers

Matthew Jocker's Expanded Topic Modeling Stopword List

sw_loughran_mcdonald_long

Loughran-McDonald Long Stopword List

sw_loughran_mcdonald_short

Loughran-McDonald Short Stopword List

sw_lucene

Lucene Stopword List

sw_mallet

MALLET Stopword List

sw_python

Python Stopword List

Installation

To download the development version of lexicon:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/lexicon")

Contact

You are welcome to:

News

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

..

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

lexicon 1.0.1 - 1.1.3

BUG FIXES

  • hash_lemmas had the lemma of as to be a. This was incorrect (spotted by Jonathan Bratt).

  • hash_lemmas had Spaces before 2 tokens (" furtherst", " skilled") meaning.
    This extra white space has been stripped.

  • The hash_sentiment_senticnett dictionary contained "sparsely" which is also contained in hash_valence_shifters. This term has been dropped from the hash_sentiment_senticnett dictionary. See # 12 for more info.

NEW FEATURES

  • profanity_zac_anger added to provide a longer list of profane words.

  • profanity_racist added to provide a profane list that is specific for detecting racist terms.

  • key_regressive_imagery added to provide R users with access to Colin Martindale's (1975, 1990) English Regressive Imagery Dictionary (RID). The Regressive Imagery Dictionary (RID) is a text analysis coding taxonomy that can be used to measure the degree to which a text is primordial vs. conceptual.

  • key_corporate_social_responsibility added to provide R users with access to Pencle & Mălăescu's Corporate Social Responsibility (CSR) Dictionary.

MINOR FEATURES

  • available_data picks up a regex argument to search for specific substrings and return matching rows.

IMPROVEMENTS

  • hash_sentiment_jockers_rinker now contains the word 'fuckin'. Additionally, the word 'fucking' has a milder negative value because this word, though often used as a negator, is also used as a amplifier. By reducing it's weight it allows more positive words to have more pull but if no polarized words exist 'fucking' will still keep the typical negative direction of the clause.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("lexicon")

1.1.3 by Tyler Rinker, 2 months ago


https://github.com/trinker/lexicon


Report a bug at https://github.com/trinker/lexicon/issues?state=open


Browse source code at https://github.com/cran/lexicon


Authors: Tyler Rinker [aut, cre, cph] , University of Notre Dame [dtc, cph] , Department of Knowledge Technologies [dtc, cph] , Unicode , Inc. [dtc, cph] , John Higgins [dtc, cph] , Grady Ward [dtc] , Heiko Possel [dtc] , Michal Boleslav Mechura [dtc, cph] , Bing Liu [dtc] , Minqing Hu [dtc] , Saif M. Mohammad [dtc] , Peter Turney [dtc] , Erik Cambria [dtc] , Soujanya Poria [dtc] , Rajiv Bajpai [dtc] , Bjoern Schuller [dtc] , SentiWordNet [dtc, cph] , Liang Wu [dtc, cph] , Fred Morstatter [dtc, cph] , Huan Liu [dtc, cph] , Grammar Revolution [dtc, cph] , Vidar Holen [dtc, cph] , Alejandro U. Alvarez [dtc, cph] , Stackoverflow User user2592414 [dtc, cph] , BannedWordList.com [dtc, cph] , Apache Software Foundation [dtc, cph] , Andrew Kachites McCallum [dtc, cph] , Alireza Savand [dtc, cph] , Zact Anger [dtc, cph] , Titus Wormer [dtc, cph] , Colin Martindale [dtc, cph] , John Wiseman [dtc, cph] , Nadra Pencle [dtc, cph] , Irina Mălăescu [dtc, cph]


Documentation:   PDF Manual  


GPL-3 license


Imports data.table, syuzhet


Imported by sentimentr, textclean, textstem.


See at CRAN