Efforts are made to make Chinese text mining easier, faster, and robust to errors. Document term matrix can be generated by only one line of code; detecting encoding, segmenting and removing stop words are done automatically. Some convenient tools are also supplied.
chinese.misc NEWS and CHANGELOG
[NEW FEATURE] #1: Add a new function to convert conveniently objects among matrix, dgCMatrix, simple_triplet_matrix, DocumentTermMatrix, TermDocumentMatrix.
[BUGFIX] #1: Only some slight changes, user-invisible.
[BUGFIX] #2: scancn is modified in order to remove unicode replacement characters in texts.
[BUGFIX] #1: Now the $v of dtm created by corp_or_dtm do not have names, and is compatible with topicmodels::LDA.
[NEW FEATURE] #3: A new function dictionary_dtm is added to count term frequencies in groups.
[NEW FEATURE] #2: Now the computation work of sort_tf and word_cor is done with sparse matrix to save memory rather than first converting the object into dense matrix.
[NEW FEATURE] #1: The function word_cor now can compute up to 200 words' correlation, while the previous limit is 30.
[NEW FEATURE] #2: Users now can set their own locales in options( ) and view it by get_tmp_chi_locale( ).
[NEW FEATURE] #1: Add a new function create_ttm to generate term-term matrix.
[BUGFIX] #1: Modify the funtion scancn, but there is no user-visible change.
[NEW FEATURE] #3: Some functions temporally modify locale values internally.
[NEW FEATURE] #2: Add a new function topic_trend to compute in/decrease of topics through years.
[NEW FEATURE] #1: Add a new function word_cor to compute word correlation.
[BUGFIX] #2: This version is compatible with package tm (>=0.7), where as the function corp_or_dtm in the previous version sometimes raise error due to the update of tm.
[BUGFIX] #1: The function as.character2 is slightly modified with no user-visible change.
[NEW FEATURE] #4: The argument control in function corp_or_dtm has a new default value "auto", which calls the control list named DEFAULT_control1 in the previous version. "auto1" also points to this value. "auto2" points to the value named DEFAULT_control2 in the previous version. However, DEFAULT_control1 and DEFAULT_control2 can also be used by users.
[NEW FEATURE] #3: The argument control in the function corp_or_dtm now differs significantly from that used by DocumentTermMatrix in package tm. Please see details in the help page of corp_or_dtm.
[NEW FEATURE] #2: The function scancn and make_stoplist now has enhanced ability to deal with unrecognizable characters.
[NEW FEATURE] #1: User-visible changes: make_stoplist, slim_text have new arguments. But the new arguments are compatible with functions in the previous version.
[BUGFIX] #2: The function as.character2(x) is changed to as.character2(...), so as to corerce multiple objects in one time. The same is done to as.numeric2(...). Accordingly, some other functions of the package and their documents are also modified.
[NEW FEATURE] #1: The url of a Chinese manual is added to "chinese.misc-package" in the English manual.
[BUGFIX] #1: The auto created objects DEFAULT_cutter, DEFAULT_control1, DEFAULT_control2 now can be directly used or modified by users.