Extractive Summarization of Text with the LexRank Algorithm

An R implementation of the LexRank algorithm described by G. Erkan and D. R. Radev (2004) .


##Installation

##install from CRAN
install.packages("lexRankr")

#install from this github repo
devtools::install_github("AdamSpannbauer/lexRankr")

##Overview lexRankr is an R implementation of the LexRank algorithm discussed by Güneş Erkan & Dragomir R. Radev in LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. LexRank is designed to summarize a cluster of documents by proposing which sentences subsume the most information in that particular set of documents. The algorithm may not perform well on a set of unclustered/unrelated set of documents. As the white paper's title suggests, the sentences are ranked based on their centrality in a graph. The graph is built upon the pairwise similarities of the sentences (where similarity is measured with a modified idf cosine similiarity function). The paper describes multiple ways to calculate centrality and these options are available in the R package. The sentences can be ranked according to their degree of centrality or by using the Page Rank algorithm (both of these methods require setting a minimum similarity threshold for a sentence pair to be included in the graph). A third variation is Continuous LexRank which does not require a minimum similarity threshold, but rather uses a weighted graph of sentences as the input to Page Rank.

There are currently 7 functions in the package. The main function is lexRank() with the remaining 6 functions serving as helpers to this function (or as a means for the user to build step by step to the output of lexRank()).

##Basic Usage ####lexRank

library(lexRankr)
 
#note1: the lexrank algorithm is designed to work on a cluster of documents.
        #LexRank is built on the idea that a cluster of docs will focus on similar topics.
    
#note2: pairwise sentence similiarity is calculated for the entire set of documents passed to the function.  
        #This can be a computationally instensive process (esp with a large set of documents)
 
(topSentsDf <- lexRank(docs))

####Using helper functions for lexRank

library(lexRankr)
    
#note1: the lexrank algorithm is designed to work on a cluster of documents.
        #LexRank is built on the idea that a cluster of docs will focus on similar topics.
    
#note2: pairwise sentence similiarity is calculated for the entire set of documents passed to the function.  
        #This can be a computationally instensive process (esp with a large set of documents)
        
sentenceTokenList <- sentenceTokenParse(docs)
sentenceDf <- sentenceTokenList$sentences
tokenDf <- sentenceTokenList$tokens
similDf <- sentenceSimil(tokenDf$sentenceId, tokenDf$token, tokenDf$docId)
topSentIdsDf <- lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal)
 
(topSentsDf <- dplyr::inner_join(sentenceDf, topSentIdsDf, by=c("sentenceId"="sentenceId")))
 

##Example with Twitter ``` library(jsonlite) library(httr)

library(lexRankr)

#########################################################
#FUNCTION TO GET TEXT OF USERS LAST N TWEETS ON TIMELINE#
#########################################################
getUserTweets <- function(user, n, consKey, consSecret, token, tokenSecret) {

  #SET UP FOR API
  auth <- oauth_app("twitter", key=consKey, secret=consSecret)
  sig  <- sign_oauth1.0(auth, token=token, token_secret=tokenSecret)

  #INITIALIZE COUNTERS AND STORAGE
  nLeft  <- n
  i <- 0
  tweetText <- character(0)
  #LOOP UNTIL n IS MET
  while (nLeft > 0) {
    nToGet <- min(200, nLeft)
    i <- i+1

    #SET MAX ID IF i > 1 (MAX ID WILL KEEP TWEETS FROM BEING DUPLICATED IN GET)
    if (i==1) {
      GETurl    <- paste0("https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=", user,"&count=", nToGet)
    } else {
      GETurl    <- paste0("https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=", user,"&count=",nToGet,"&max_id=",    max_id)
    }

    #GET TWEETS
    timelineRaw <- GET(GETurl, sig)

    #EXTRACT TWEET TEXT FROM GET
    timelineContent <- content(timelineRaw)
    jsonRaw <- toJSON(timelineContent)
    jsonConv <- iconv(jsonRaw, "UTF-8", "ASCII", sub = "") #clean out emoji and other special chars
    jsonConv <- gsub("\003", "", jsonConv) #special character (^C) not caught by above clean
    jsonClean <- fromJSON(jsonConv)

    #APPEND TWEET TEXT
    tweetText <- c(tweetText, unlist(jsonClean$text))

    #STORE MAX ID FOR USE IN NEXT GETurl
    suppressWarnings(max_id <- min(unlist(jsonClean$id)))

    #UPDATE NUMBER OF TWEETS LEFT TO MEET n PARAMETER
    nLeft <- nLeft-nToGet
  }

  return(unique(tweetText))
}


##################################################################################
#FUNCTION USING LEXRANKR TO FIND THE MOST REPRESENTATIVE TWEETS OF USERS TIMELINE#
##################################################################################
tweetRankr <- function(tweetText, dropMentions=TRUE, dropHashtags=TRUE, n=5, returnTies=TRUE, printTweets = TRUE) {

  #store original tweet text
  tweetTextOg <- tweetText

  #remove instances of @username from text
  if (dropMentions) tweetText <- gsub("\\@\\w+","",tweetText)
  #remove insances of #topic from text
  if (dropHashtags) tweetText <- gsub("\\#\\w+","",tweetText)
  #clean up any multiple spaces introduced by modifying text
  tweetText <- trimws(gsub("\\s+", " ", tweetText))

  #parse text and create doc/sent Ids for each token
  sentenceTokenList <- lexRankr::sentenceTokenParse(tweetText)
  #store token data.frame from list output
    #token data.frame has columns for the token, document Id, and sentence Id
  tweetTokenDf <- sentenceTokenList$tokens

  #compute pairwise tweet similiarity
    #using document id from token df instead of sentence id
    #using docId will find most central tweets as opposed to most central sentences within tweets
  similDf <- lexRankr::sentenceSimil(tweetTokenDf$docId, tweetTokenDf$token, tweetTokenDf$docId)

  #apply lexRank algorithm to return top n tweet ids
  topTweetIdsDf <- lexRankr::lexRankFromSimil(similDf$sent1, similDf$sent2, similDf$similVal, n = n, returnTies = returnTies)

  #subset tweet text variable to extract most central tweets according to lexrank
  topTweets <- tweetTextOg[as.numeric(topTweetIdsDf$sentenceId)]

  if (printTweets) cat(paste(paste0(1:length(topTweets), ") ",topTweets), collapse="\n\n"))
  invisible(return(topTweets))
}

consKey     <- 'consumer key'
consSecret  <- 'consumer secret'
token       <- 'access token'
tokenSecret <- 'access token secret'

###################
# Hillary Clinton #
###################
tweetTextHill <- getUserTweets(user="HillaryClinton", n=5000, consKey, consSecret, token, tokenSecret)
topTweetsHill <- tweetRankr(tweetTextHill)
```
  1. "Hillary Clinton must become the next president of the United States. @BernieSanders #DemsInPhilly

  2. "In this election, Im with her!" @FLOTUS on Hillary #DemsInPhilly

  3. We are better than this.

  4. "As your president, I will always have your back." Hillary

  5. She knows that love trumps hate. @POTUS on Hillary

    ################
    # Donald Trump #
    ################
    tweetTextTrump <- getUserTweets(user="realDonaldTrump", n=5000, consKey, consSecret, token, tokenSecret)
    topTweetsTrump <- tweetRankr(tweetTextTrump)
  6. "@tcloer11: @realDonaldTrump Great job! Make America Great Again!"

  7. Wisconsin, we will MAKE AMERICA GREAT AGAIN!

  8. MAKE AMERICA GREAT AGAIN!

  9. MAKE AMERICA GREAT AGAIN! MAKE AMERICA SAFE AGAIN!

  10. AMERICA FIRST!

    ##################
    # Bernie Sanders #
    ##################
    tweetTextBern <- getUserTweets(user="SenSanders", n=5000, consKey, consSecret, token, tokenSecret)
    topTweetsBern <- tweetRankr(tweetTextBern)
  11. Unless Congress stands up for the middle class that's getting stepped on by the billionaire class, soon there won't be aiddle class left.

  12. The current federal minimum wage of $7.25 an hour is a starvation wage and must be raised. The minimum wage must become aiving wage.

  13. Weve got to stand up to the fossil fuel industry and fight for legislation that transforms our energy system away from fossiluels.

  14. There is no justice when so few have so much and so many have so little.

  15. Health care is a right, not a privilege. Everyone in America should be able to access the health care they need regardless of their income.

    #############
    # Rbloggers #
    #############
    tweetTextRblog <- getUserTweets(user="Rbloggers", n=5000, consKey, consSecret, token, tokenSecret)
    topTweetsRblog <- tweetRankr(tweetTextRblog)
  16. New R job: Data Scientist Machine Learning https://t.co/YiWwXkmxmc #rstats #DataScience #jobs

  17. New R job: Principal Analysts x2, Senior Analyst, Analyst (@ Wellington ) http://t.co/5OLIDl51tw #rstats #jobs

  18. A Few Days of Python: Using R in Python http://t.co/28j8CAYThn #rstats

  19. Network visualization part 4: 3D networks https://t.co/U6U53xG679 #rstats #DataScience

  20. Network visualization part 4: 3D networks https://t.co/Y625xNNr03 #rstats #DataScience

News

lexRankr 0.3.0

  • sentence similarity from sentenceSimil now calculated using Rcpp. Improves speed by ~25%-30% over old implementation using proxy package

lexRankr 0.2.0

  • Added logic to avoid naming conflicts in proxy::pr_DB in sentenceSimil (#1, @AdamSpannbauer)

  • Added check and error for cases where no sentences above threshold in lexRankFromSimil (#2, @AdamSpannbauer)

  • tokenize now has stricter punctuation removal. Removes all non-alphnumeric characters as opposed to removing [:punct:]

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("lexRankr")

0.4.1 by Adam Spannbauer, 4 months ago


https://github.com/AdamSpannbauer/lexRankr/


Report a bug at https://github.com/AdamSpannbauer/lexRankr/issues/


Browse source code at https://github.com/cran/lexRankr


Authors: Adam Spannbauer <spannbaueradam@gmail.com>


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports dplyr, tidyr, magrittr, stringr, SnowballC, igraph, tm, Rcpp

Suggests covr, testthat, R.rsp

Linking to Rcpp


See at CRAN