Character String Processing Facilities

Allows for fast, correct, consistent, portable, as well as convenient character string/text processing in every locale and any native encoding. Owing to the use of the ICU library, the package provides R users with platform-independent functions known to Java, Perl, Python, PHP, and Ruby programmers. Among available features there are: pattern searching (e.g., with ICU Java-like regular expressions or the Unicode Collation Algorithm), random string generation, case mapping, string transliteration, concatenation, Unicode normalization, date-time formatting and parsing, etc.


               stringi package NEWS and CHANGELOG


  • [REGEX] #147: regex look-behind assertions may fail to find a multibyte Unicode search pattern [solved in ICU4C 57m1, see]

  • [BUGFIX] round(), snprintf() is not C++98

  • [BUGFIX] #214: allow a regex pattern like .* to match an empty string.

  • [BUGFIX] #210: stri_replace_all_fixed(c("1", "NULL"), "NULL", NA) now results in c("1", NA).

  • [NEW FEATURE] #199: stri_sub<- now allows for ignoring NA locations (a new omit_na argument added).

  • [NEW FEATURE] #207: stri_sub<- now allows for substring insertions (via length=0).

  • [NEW FUNCTION] #124: stri_subset<- functions added.

  • [NEW FEATURE] #216: stri_detect, stri_subset, stri_subset<- gained a negate argument.

  • [NEW FUNCTION] #175: stri_join_list concatenates all strings in a list of character vectors. Useful with, e.g., stri_extract_all_regex, stri_extract_all_words etc.

  • [GENERAL] #88: C API is now available for use in, e.g., Rcpp packages, see for an example.

  • [BUGFIX] #183: Floating point exception raised in stri_sub() and stri_sub<-() when to or length was a zero-length numeric vector.

  • [BUGFIX] #180: stri_c() warned incorrectly (recycling rule) when using more than 2 elements.

  • [BACKWARD INCOMPATIBILITY] stri_install_check() and stri_install_icudt() are now deprecated. From now on they are supposed to be used only by the stringi installer.

  • [BUGFIX] #176: a patch for sys/feature_tests.h no longer included (the original file was copyrighted by Sun Microsystems); fixed the Compiler or options invalid for pre-UNIX 03 X/Open applications and pre-2001 POSIX applications error by forcing (conditionally) _XPG6 conformance.

  • [BUGFIX] #174: stri_paste() did not generate any warning when the recycling rule is violated and sep=="".

  • [BUGFIX] #170: icu::setDataDirectory no longer called if our ICU source bundle is not used (this used to cause build problems on openSUSE).

  • [BUILD TIME] #169: ./configure now tries to switch to the "standard" C++ compiler if a C++11 one is not properly configured.

  • [BUILD TIME] (Biarch: TRUE) now mimics autoconf's AC_SUBST and AC_CONFIG_FILES so that the build process is now more similar across different platforms.

  • [NEW FEATURE] stri_info() now also gives information about which version of ICU4C is in use (system or bundle).

  • [BACKWARD INCOMPATIBILITY] The second argument to stri_pad_*() has been renamed width.

  • [GENERAL] #69: stringi is now bundled with ICU4C 55.1.

  • [NEW FUNCTIONS] stri_extract_*_boundaries() extract text between text boundaries.

  • [NEW FUNCTION] #46: stri_trans_char() is a stringi-flavoured chartr() equivalent.

  • [NEW FUNCTION] #8: stri_width() approximates the width of a string in a more Unicodish fashion than nchar(..., "width")

  • [NEW FEATURE] #149: stri_pad() and stri_wrap() now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of stri_wrap() is now such that it does not get rid of non-breaking, zero width, etc. spaces

  • [NEW FEATURE] #133: stri_wrap() silently allows for width <= 0 (for compatibility with strwrap()).

  • [NEW FEATURE] #139: stri_wrap() gained a new argument: whitespace_only.

  • [NEW FUNCTIONS] #137: date-time formatting/parsing:

    • stri_timezone_list() - lists all known time zone identifiers
    • stri_timezone_set(), stri_timezone_get() - manage current default time zone
    • stri_timezone_info() - basic information on a given time zone
    • stri_datetime_symbols() - localizable date-time formatting data
    • stri_datetime_fstr() - convert a strptime-like format string to an ICU date/time format string
    • stri_datetime_format() - convert date/time to string
    • stri_datetime_parse() - convert string to date/time object
    • stri_datetime_create() - construct date-time objects from numeric representations
    • stri_datetime_now() - return current date-time
    • stri_datetime_fields() - get values for date-time fields
    • stri_datetime_add() - add specific number of date-time units to a date-time object
  • [GENERAL] #144: Performance improvements in handling ASCII strings (these affect stri_sub(), stri_locate() and other string index-based operations)

  • [GENERAL] #143: Searching for short fixed patterns (stri_*_fixed()) now relies on the current libC's implementation of strchr() and strstr(). This is very fast e.g. on glibc utilizing the SSE2/3/4 instruction set.

  • [BUILD TIME] #141: a local copy of icudt*.zip may be used on package install; see the INSTALL file for more information.

  • [BUILD TIME] #165: the ./configure option --disable-icu-bundle forces the use of system ICU when building the package.

  • [BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g. @calendar=gregorian expands to DEFAULT_LOCALE@calendar=gregorian.

  • [BUGFIX] #134: stri_extract_all_words() did not accept simplify=NA.

  • [BUGFIX] #132: incorrect behavior in stri_locate_regex() for matches of zero lengths

  • [BUGFIX] stringr/#73: stri_wrap() returned CHARSXP instead of STRSXP on empty string input with simplify=FALSE argument.

  • [BUGFIX] #164: libicu-dev usage used to fail on Ubuntu (LIBS shall be passed after LDFLAGS and the list of .o files).

  • [BUGFIX] #168: Build now fails if icudt is not available.

  • [BUGFIX] #135: C++11 is now used by default (see the INSTALL file, however) to build stringi from sources. This is because ICU4C uses the long long type which is not part of the C++98 standard.

  • [BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.

  • [BUGFIX] Force ICU u_init() call on stringi dynlib load.

  • [BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.

  • [IMPORTANT CHANGE] n_max argument in stri_split_*() has been renamed n.

  • [IMPORTANT CHANGE] simplify=FALSE in stri_extract_all_*() and stri_split_*() now calls stri_list2matrix() with fill="". fill=NA_character_ may be obtained by using simplify=NA.

  • [IMPORTANT CHANGE, NEW FUNCTIONS] #120: stri_extract_words has been renamed stri_extract_all_words and stri_locate_boundaries - stri_locate_all_boundaries as well as stri_locate_words - stri_locate_all_words. New functions are now available: stri_locate_first_boundaries, stri_locate_last_boundaries, stri_locate_first_words, stri_locate_last_words, stri_extract_first_words, stri_extract_last_words.

  • [IMPORTANT CHANGE] #111: opts_regex, opts_collator, opts_fixed, and opts_brkiter can now be supplied individually via .... In other words, you may now simply call e.g. stri_detect_regex(str, pattern, case_insensitive=TRUE) instead of stri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE)).

  • [NEW FEATURE] #110: Fixed pattern search engine's settings can now be supplied via opts_fixed argument in stri_*_fixed(), see stri_opts_fixed(). A simple (not suitable for natural language processing) yet very fast case_insensitive pattern matching can be performed now. stri_extract_*_fixed is again available.

  • [NEW FEATURE] #23: stri_extract_all_fixed, stri_count, and stri_locate_all_fixed may now also look for overlapping pattern matches, see ?stri_opts_fixed.

  • [NEW FEATURE] #129: stri_match_*_regex gained a cg_missing argument.

  • [NEW FEATURE] #117: stri_extract_all_*(), stri_locate_all_*(), stri_match_all_*() gained a new argument: omit_no_match. Setting it to TRUE makes these functions compatible with their stringr equivalents.

  • [NEW FEATURE] #118: stri_wrap() gained indent, exdent, initial, and prefix arguments. Moreover Knuth's dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128.

  • [NEW FEATURE] #122: stri_subset() gained an omit_na argument.

  • [NEW FEATURE] stri_list2matrix() gained an n_min argument.

  • [NEW FEATURE] #126: stri_split() now is also able to act just like stringr::str_split_fixed().

  • [NEW FEATURE] #119: stri_split_boundaries() now have n, tokens_only, and simplify arguments. Additionally, stri_extract_all_words() is now equipped with simplify arg.

  • [NEW FEATURE] #116: stri_paste() gained a new argument: ignore_null. Setting it to TRUE makes this function more compatible with paste().

  • [OTHER] #123: useDynLib is used to speed up symbol look-up in the compiled dynamic library.

  • [BUGFIX] #114: stri_paste(): could return result in an incorrect order.

  • [BUGFIX] #94: Run-time errors on Solaris caused by setting -DU_DISABLE_RENAMING=1 -- memory allocation errors in i.a. ICU's UnicodeString. This setting also caused some ABSan sanity check failures within ICU code.

  • [IMPORTANT CHANGE] #87: %>% overlapped with the pipe operator from the magrittr package; now each operator like %>% has been renamed %s>%.

  • [IMPORTANT CHANGE] #108: Now the BreakIterator (for text boundary analysis) may be better controlled via stri_opts_brkiter() (see options type and locale which aim to replace now-removed boundary and locale parameters to stri_locate_boundaries, stri_split_boundaries, stri_trans_totitle, stri_extract_words, stri_locate_words).

  • [NEW FUNCTIONS] #109: stri_count_boundaries() and stri_count_words() count the number of text boundaries in a string.

  • [NEW FUNCTIONS] #41: stri_startswith_*() and stri_endswith_*() determine whether a string starts or ends with a given pattern.

  • [NEW FEATURE] #102: stri_replace_all_*() gained a vectorize_all() parameter, which defaults to TRUE for backward compatibility.

  • [NEW FUNCTION] #91: stri_subset_*(), a convenient and more efficient substitute for str[stri_detect_*(str, ...)], added.

  • [NEW FEATURE] #100: stri_split_fixed(), stri_split_charclass(), stri_split_regex(), stri_split_coll() gained a tokens_only parameter, which defaults to FALSE for backward compatibility.

  • [NEW FUNCTION] #105: stri_list2matrix() converts lists of atomic vectors to character matrices, useful in connection with stri_split and stri_extract.

  • [NEW FEATURE] #107: stri_split_*() now allow setting an omit_empty=NA argument.

  • [NEW FEATURE] #106: stri_split() and stri_extract_all() gained a simplify argument (if TRUE, then stri_list2matrix(..., byrow=TRUE) is called on the resulting list.

  • [NEW FUNCTION] #77: stri_rand_lipsum() generates (pseudo)random dummy lorem ipsum text.

  • [NEW FEATURE] #98: stri_trans_totitle() gained a opts_brkiter parameter; it indicates which ICU BreakIterator should be used when performing case mapping.

  • [NEW FEATURE] stri_wrap() gained a new parameter: normalize.

  • [BUGFIX] #86: stri_*_fixed(), stri_*_coll(), and stri_*_regex() could give incorrect results if one of search strings were of length 0.

  • [BUGFIX] #99: stri_replace_all() did not use the replacement arg.

  • [BUGFIX] #112: Some of the objects were not PROTECTed from being garbage collected, which might have caused spontaneous SEGFAULTS.

  • [BUGFIX] Some collator's options were not passed correctly to ICU services.

  • [BUGFIX] Memory leaks causes as detected by valgrind --tool=memcheck --leak-check=full have been removed.

  • [DOCUMENTATION] Significant extensions/clean ups in the stringi manual.

  • icudt-dependent examples are no longer run if icudt is not available.

  • [BUGFIX] Issues with loading of misaligned addresses in stri_*_fixed().

  • [IMPORTANT CHANGE] stri_cmp*() now do not allow for passing opts_collator=NA. From now on, stri_cmp_eq, stri_cmp_neq, and the new operators %===%, %!==%, %stri===%, and %stri!==% are locale-independent operations, which base on code point comparisons. New functions stri_cmp_equiv and stri_cmp_nequiv (and from now on also %==%, %!=%, %stri==%, and %stri!=%) test for canonical equivalence.

  • [IMPORTANT CHANGE] stri_*_fixed() search functions now perform a locale-independent exact (byte-wise, of course after conversion to UTF-8) pattern search. All the Collator-based, locale-dependent search routines are now available via stri_*_coll(). The reason for this is that ICU USearch has currently very poor performance and in many search tasks in fact it is sufficient to do exact pattern matching.

  • [GENERAL] stri_*_fixed now use a tweaked Knuth-Morris-Pratt search algorithm, which improves the search performance drastically.

  • [IMPORTANT CHANGE] stri_enc_nf*() and stri_enc_isnf*() function families have been renamed to stri_trans_nf*() and stri_trans_isnf*(), respectively. This is because they deal with text transforming, and not with character encoding. Moreover, all such operations may also be performed by ICU's Transliterator (see below).

  • [NEW FUNCTION] stri_trans_general() and stri_trans_list() give access to ICU's Transliterator: may be used to perform very general text transforms.

  • [NEW FUNCTION stri_split_boundaries() utilizes ICU's BreakIterator to split strings at specific text boundaries. Moreover, stri_locate_boundaries indicates positions of these boundaries.

  • [NEW FUNCTION] stri_extract_words() uses ICU's BreakIterator to extract all words from a text. Additionally, stri_locate_words locates start and end positions of words in a text.

  • [NEW FUNCTION] stri_pad(), stri_pad_left(), stri_pad_right(), and stri_pad_both pad a string with a specific code point.

  • [NEW FUNCTION] stri_wrap() breaks paragraphs of text into lines. Two algorithms (greedy and minimal raggedness) are available.

  • [IMPORTANT CHANGE] stri_*_charclass() search functions now rely solely on ICU's UnicodeSet patterns. All previously accepted charclass identifiers became invalid. However, new patterns should now be more familiar to the users (they are regex-like). Moreover, we observe a very nice performance gain.

  • [IMPORTANT CHANGE] stri_sort() now does not include NAs in output vectors by default, for compatibility with sort(). Moreover, currently none of the input vector's attributes are preserved.

  • [NEW FUNCTION] stri_unique() extracts unique elements from a character vector.

  • [NEW FUNCTIONS] stri_duplicated() and stri_duplicated_any() determine duplicate elements in a character vector.

  • [NEW FUNCTION] stri_replace_na() replaces NAs in a character vector with a given string, useful for emulating e.g. R's paste() behavior.

  • [NEW FUNCTION] stri_rand_shuffle() generates a random permutation of code points in a string.

  • [NEW FUNCTION] stri_rand_strings() generates random strings.

  • [NEW FUNCTIONS] New functions and binary operators for string comparison: stri_cmp_eq(), stri_cmp_neq(), stri_cmp_lt(), stri_cmp_le(), stri_cmp_gt(), stri_cmp_ge(), %==%, %!=%, %<%, %<=%, %>%, %>=%.

  • [NEW FUNCTION] stri_enc_mark() reads declared encodings of character strings as seen by stringi.

  • [NEW FUNCTION] stri_enc_tonative(str) is an alias to stri_encode(str, NULL, NULL).

  • [NEW FEATURE] stri_order() and stri_sort() now have an additional argument na_last (defaults to TRUE and NA, respectively).

  • [NEW FEATURE] stri_replace_all_charclass(), stri_extract_all_charclass(), and stri_locate_all_charclass() now have a new arg, merge (defaults to FALSE for backward-compatibility). It may be used to e.g. replace sequences of white spaces with a single space.

  • [NEW FEATURE] stri_enc_toutf8() now has a new validate arg (defaults to FALSE for backward-compatibility). It may be used in a (rare) case in which a user wants to fix an invalid UTF-8 byte sequence. stri_length (among others) now detect invalid UTF-8 byte sequences.

  • [NEW FEATURE] All binary operators %???% now also have aliases %stri???%.

  • [GENERAL] Performance improvements in StriContainerUTF8 and StriContainerUTF16 (they affect most other functions).

  • [GENERAL] Significant performance improvements in stri_join(), stri_flatten(), stri_cmp(), stri_trans_to*(), and others.

  • [GENERAL] Added 3rd mirror site for our icudt binary distribution.

  • U_MISSING_RESOURCE_ERROR message in StriException now suggests calling stri_install_check().

  • [BUGFIX] UTF-8 BOMs are now silently removed from input strings.

  • [BUGFIX] no more attempts to re-encode UTF-8 encoded strings if native encoding=UTF-8 in StriContainerUTF8.

  • [BUGFIX] possible memory leaks when throwing errors via Rf_error().

  • [BUGFIX] stri_order() and stri_cmp() could return incorrect results for opts_collator=NA.

  • [BUGFIX] stri_sort() did not guarantee to return strings in UTF-8.

  • LICENSE tweaks.

  • Initial CRAN release.

  • Fixed bugs detected with ASan and UBSan, e.g. fixed CharClass::gcmask type (enum -> uint32_t) (reported by UBSan).

  • Fixed array over-runs detected with valgrind in string8.h.

  • Fixed unitialized class fields in StriContainerUTF8 (reported by valgrind).

  • License changed to BSD-3-clause, COPYRIGHTS updated.

  • icudt is not shipped with stringi anymore; it is now downloaded in install.libs.R from one of our servers.

  • New functions: stri_install_check(), stri_install_icudt().

  • System ICU is used on systems which do have one (version >= 50 needed). ICU is autodetected with pkg-config in ./configure. Pass '--disable-pkg-config' to ./configure to force building ICU from sources.

  • icudt52b (custom subset) is now shipped with stringi (for big-endian, ASCII systems).

  • Fixed some Solaris-related issues while preparing stringi for CRAN submission.

  • ICU4C 52.1 sources included (common, i18n, stubdata + icu52dt.dat loaded dynamically). Compilation via Makevars.

  • stringi now does not depend on any external libraries.

  • ICU4C is now statically linked on Windows.

  • First OS X binary build.

  • The package is being intensively tested by our students @ FMIS WUT.

  • Using pkg-config via ./configure to look for ICU4C libs.

  • First Windows binary build.

  • Compilation passed on Oracle Sun Studio compiler collection.

  • By now we have implemented most of the functionality scheduled for milestone 0.1.

  • The stringi project has been established on GitHub.

