A multitude of character string/text/natural language processing tools: pattern searching (e.g., with 'Java'-like regular expressions or the 'Unicode' collation algorithm), random string generation, case mapping, string transliteration, concatenation, sorting, padding, wrapping, Unicode normalisation, date-time formatting and parsing, and many more. They are fast, consistent, convenient, and - owing to the use of the 'ICU' (International Components for Unicode) library - portable across all locales and platforms.
[NEW FEATURE] #30: New function stri_sub_all()
- a version of
stri_sub()
accepting list from
/to
/length
arguments for extracting
multiple substrings from each string in a character vector.
[NEW FEATURE] #30: New function stri_sub_all<-()
(and its %<%
-friendly
version, stri_sub_replace_all()
) - for replacing multiple substrings
with corresponding replacement strings.
[NEW FEATURE] In stri_sub_replace()
, value
parameter
has a new alias, replacement
.
[NEW FEATURE] New convenience functions based on stri_remove_empty()
:
stri_omit_empty_na()
, stri_remove_empty_na()
, stri_omit_empty()
,
and also stri_remove_na()
, stri_omit_na()
.
[BUGFIX] #343: stri_trans_char()
did not yield correct results
for overlapping pattern and replacement strings.
[WARNFIX] #205: configure.ac
is now included in the source bundle.
[BACKWARD INCOMPATIBILITY] #335: A fix to #314 prevented (by design) the use
of the system ICU if the library had been compiled with U_CHARSET_IS_UTF8=1
.
However, this is the default setting in libicu
>=61. From now on, in such
cases the system ICU is used more eagerly, but stri_enc_set()
issues
a warning stating that the default (UTF-8) encoding cannot be changed.
[NEW FEATURE] #232: All stri_detect_*
functions now have the max_count
argument that allows for, e.g., stopping at the first pattern occurrence.
[NEW FEATURE] #338: stri_sub_replace()
is now an alias for stri_sub<-()
which makes it much more easily pipable (@yutannihilation, @BastienFR).
[NEW FEATURE] #334: Added missing icudt61b.dat
to support big-endian
platforms (thanks to Dimitri John Ledkov @xnox).
[BUGFIX] #296: Out-of-the box build used to fail on CentOS 6, upgraded
./configure
to --disable-cxx11
more eagerly at an early stage.
[BUGFIX] #341: Fixed possible buffer overflows when calling strncpy()
from within ICU 61.
[BUGFIX] #325: Made ./configure
more portable so that it works
under /bin/dash
now.
[BUGFIX] #319: Fixed overflow in stri_rand_shuffle()
.
[BUGFIX] #337: Empty search patters in search functions (e.g.,
stri_split_regex()
and stri_count_fixed()
) used to raise
too many warnings on empty search patters.
[BUGFIX] #314: Testing U_CHARSET_IS_UTF8
in ./configure
when
using pkg-build
.
[BUILD TIME] #317: Included (reverted in version 1.3.1, the icudt61l.zip
in the source bundle to solve
the frequent icudt download failed
error (also on CRAN's windows-release
and windows-oldrel
).winbuilder
errors were caused by a build chain bug).
[BUGFIX] #296: Fixed the behavior of the ./configure
script on CentOS 6.
[BUGFIX] Fixed broken Windows build by updating the icudt
mirror list.
[GENERAL] #193: stringi is now bundled with ICU4C 61.1, which is used on most Windows and OS X builds as well as on *nix systems not equipped with ICU. However, if the C++11 support is disabled, stringi will be built against ICU4C 55.1. The update to ICU brings Unicode 10.0 support, including new emoji characters.
[BUGFIX] #288: stri_match()
did not return the correct number of columns
when input was empty.
[NEW FEATURE] #188: stri_enc_detect()
now returns a list of data frames.
[NEW FEATURE] #289: stri_flatten()
how has na_empty
and omit_empty
arguments.
[NEW FEATURE] New functions: stri_remove_empty()
, stri_na2empty()
.
[NEW FEATURE] #285: Coercion from a non-trivial list (one that consists of atomic vectors, each of length 1) to an atomic vector now issues a warning.
[WARN] Removed -Wparentheses
warnings in icu55/common/cstring.h:38:63
and icu55/i18n/windtfmt.cpp
in the ICU4C 55.1 bundle.
icu55/i18n/winnmfmt.cpp
) and suppressing important diagnostics
(src/icu55/i18n/decNumber.c
).[WINDOWS SPECIFIC] #270: Strings marked with latin1
encoding
are now converted internally to UTF-8 using the WINDOWS-1252 codec.
This fixes problems with - among others - displaying the Euro sign.
[NEW FEATURE] #263: Added support for custom rule-based break iteration,
see ?stri_opts_brkiter
.
[NEW FEATURE] #267: omit_na=TRUE
in stri_sub<-()
now ignores missing
values in any of the arguments provided.
[BUGFIX] Fixed unPROTECTed variable names and stack imbalances
as reported by rchk
.
[GENERAL] stringi now requires ICU4C >= 52.
[BUGFIX] Fixed errors pointed out by clang-UBSAN
in stri_brkiter.h
.
[GENERAL] stringi now requires R >= 2.14.
[BUILD TIME] #238, #220: Now trying standard ICU4C build flags if a call
to pkg-config
fails.
[BUILD TIME] #258: Use CXX11
instead of CXX1X
on R >= 3.4.
[BUILD TIME, BUGFIX] #254: dir.exists()
is R >= 3.2.
[REMOVE DEPRECATED] stri_install_check()
and stri_install_icudt()
marked as deprecated in stringi 0.5-5 are no longer being exported.
[BUGFIX] #227: Incorrect behavior of stri_sub()
and stri_sub<-()
if the empty string was the result.
[BUILD TIME] #231: The ./configure
(*NIX only) script now reads the
following environment variables: STRINGI_CFLAGS
, STRINGI_CPPFLAGS
,
STRINGI_CXXFLAGS
, STRINGI_LDFLAGS
, STRINGI_LIBS
,
STRINGI_DISABLE_CXX11
, STRINGI_DISABLE_ICU_BUNDLE
,
STRINGI_DISABLE_PKG_CONFIG
, PKG_CONFIG
,
see INSTALL
for more information.
[BUILD TIME] #253: Call to R_useDynamicSymbols()
added.
[BUILD TIME] #230: icudt
is now being downloaded by
./configure
(*NIX only) before building.
[BUILD TIME] #242: _COUNT/_LIMIT
enum constants have been deprecated
as of ICU 58.2, stringi code has been upgraded accordingly.
round()
, snprintf()
is not C++98.[BUGFIX] #214: Allow a regex pattern like .*
to match an empty string.
[BUGFIX] #210: stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)
now results in c("1", NA)
.
[NEW FEATURE] #199: stri_sub<-()
now allows for ignoring NA
locations
(a new omit_na
argument added).
[NEW FEATURE] #207: stri_sub<-()
now allows for substring insertions
(via length=0
).
[NEW FUNCTION] #124: stri_subset<-()
functions added.
[NEW FEATURE] #216: stri_detect()
, stri_subset()
, stri_subset<-()
now all have the negate
argument.
[NEW FUNCTION] #175: stri_join_list()
concatenates all strings
in a list of character vectors. Useful in conjunction with, e.g.,
stri_extract_all_regex()
, stri_extract_all_words()
, etc.
[GENERAL] #88: C API is now available for use in, e.g., Rcpp packages, see https://github.com/gagolews/ExampleRcppStringi for an example.
[BUGFIX] #183: Floating point exception raised in stri_sub()
and
stri_sub<-()
when to
or length
was a zero-length numeric vector.
[BUGFIX] #180: stri_c()
warned incorrectly (recycling rule) when using more
than two elements.
[BACKWARD INCOMPATIBILITY] stri_install_check()
and stri_install_icudt()
are now deprecated. From now on they are supposed to be used only
by the stringi installer.
[BUGFIX] #176: A patch for sys/feature_tests.h
no longer included
(the original file was copyrighted by Sun Microsystems); fixed the Compiler
or options invalid for pre-UNIX 03 X/Open applications and pre-2001 POSIX
applications error by forcing (conditionally) _XPG6
conformance.
[BUGFIX] #174: stri_paste()
did not generate any warning when
the recycling rule is violated and sep==""
.
[BUGFIX] #170: icu::setDataDirectory
is no longer called if our ICU
source bundle is not used (this used to cause build problems on openSUSE).
[BUILD TIME] #169: ./configure
now tries to switch to the standard
C++ compiler if a C++11 one is not configured correctly.
[BUILD TIME] configure.win
(Biarch: TRUE
) now mimics autoconf
's
AC_SUBST
and AC_CONFIG_FILES
so that the build process is now
more similar across different platforms.
[NEW FEATURE] stri_info()
now also gives information about which version
of ICU4C is in use (system or bundle).
[BACKWARD INCOMPATIBILITY] The second argument to stri_pad_*()
has
been renamed width
.
[GENERAL] #69: stringi is now bundled with ICU4C 55.1.
[NEW FUNCTIONS] stri_extract_*_boundaries()
extract text between text
boundaries.
[NEW FUNCTION] #46: stri_trans_char()
is a stringi-flavoured
chartr()
equivalent.
[NEW FUNCTION] #8: stri_width()
approximates the width of a string
in a more Unicode-ish fashion than nchar(..., "width")
[NEW FEATURE] #149: stri_pad()
and stri_wrap()
is now (by default)
based on code point widths instead of the number of code points.
Moreover, the default behavior of stri_wrap()
is now such that it
does not get rid of non-breaking, zero width, etc., spaces.
[NEW FEATURE] #133: stri_wrap()
silently allows for width <= 0
(for compatibility with strwrap()
).
[NEW FEATURE] #139: stri_wrap()
gained a new argument: whitespace_only
.
[NEW FUNCTIONS] #137: Date-time formatting/parsing:
stri_timezone_list()
- lists all known time zone identifiers;stri_timezone_set()
, stri_timezone_get()
- manage the current
default time zone;stri_timezone_info()
- basic information on a given time zone;stri_datetime_symbols()
- gives localizable date-time formatting data;stri_datetime_fstr()
- converts a strptime
-like format string
to an ICU date/time format string;stri_datetime_format()
- converts date/time to string;stri_datetime_parse()
- converts string to date/time object;stri_datetime_create()
- constructs date-time objects
from numeric representations;stri_datetime_now()
- returns current date-time;stri_datetime_fields()
- returns date-time fields' values;stri_datetime_add()
- adds specific number of date-time units
to a date-time object.[GENERAL] #144: Performance improvements in handling ASCII strings
(these affect stri_sub()
, stri_locate()
and other string index-based
operations)
[GENERAL] #143: Searching for short fixed patterns (stri_*_fixed()
) now
relies on the current libC
's implementation of strchr()
and strstr()
.
This is very fast, e.g., on glibc
using the SSE2/3/4
instruction set.
[BUILD TIME] #141: A local copy of icudt*.zip
may be used on package
install; see the INSTALL
file for more information.
[BUILD TIME] #165: The ./configure
option --disable-icu-bundle
forces the use of system ICU when building the package.
[BUGFIX] Locale specifiers are now normalized in a more intelligent way:
e.g., @calendar=gregorian
expands to [email protected]=gregorian
.
[BUGFIX] #134: stri_extract_all_words()
did not accept simplify=NA
.
[BUGFIX] #132: Incorrect behavior in stri_locate_regex()
for matches
of zero lengths.
[BUGFIX] stringr/#73: stri_wrap()
returned CHARSXP
instead of STRSXP
on empty string input with simplify=FALSE
argument.
[BUGFIX] #164: Using libicu-dev
failed on Ubuntu
(LIBS
shall be passed after LDFLAGS
and the list of .o
files).
[BUGFIX] #168: Build now fails if icudt
is not available.
[BUGFIX] #135: C++11 is now used by default (see the INSTALL
file,
however) to build stringi from sources. This is because ICU4C uses the
long long
type which is not part of the C++98 standard.
[BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.
[BUGFIX] Force ICU u_init()
call on stringi dynlib load.
[BUGFIX] #157: Many overfull hboxes in the package PDF manual has been corrected.
[IMPORTANT CHANGE] n_max
argument in stri_split_*()
has been renamed n
.
[IMPORTANT CHANGE] simplify=FALSE
in stri_extract_all_*()
and
stri_split_*()
now calls stri_list2matrix()
with fill=""
.
fill=NA_character_
may be obtained by using simplify=NA
.
[IMPORTANT CHANGE, NEW FUNCTIONS] #120: stri_extract_words()
has been
renamed stri_extract_all_words()
and stri_locate_boundaries()
-
stri_locate_all_boundaries()
as well as stri_locate_words()
-
stri_locate_all_words()
. New functions are now available:
stri_locate_first_boundaries()
, stri_locate_last_boundaries()
,
stri_locate_first_words()
, stri_locate_last_words()
,
stri_extract_first_words()
, stri_extract_last_words()
.
[IMPORTANT CHANGE] #111: opts_regex
, opts_collator
, opts_fixed
, and
opts_brkiter
can now be supplied individually via ...
.
In other words, you may now simply call, e.g.,
stri_detect_regex(str, pattern, case_insensitive=TRUE)
instead of
stri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE))
.
[NEW FEATURE] #110: Fixed pattern search engine's settings can
now be supplied via opts_fixed
argument in stri_*_fixed()
,
see stri_opts_fixed()
. A simple (not suitable for natural language
processing) yet very fast case_insensitive
pattern matching can be
performed now. stri_extract_*_fixed()
is again available.
[NEW FEATURE] #23: stri_extract_all_fixed()
, stri_count()
, and
stri_locate_all_fixed()
may now also look for overlapping pattern
matches, see ?stri_opts_fixed
.
[NEW FEATURE] #129: stri_match_*_regex()
gained a cg_missing
argument.
[NEW FEATURE] #117: stri_extract_all_*()
, stri_locate_all_*()
,
stri_match_all_*()
gained a new argument: omit_no_match
.
Setting it to TRUE
makes these functions compatible with their
stringr
equivalents.
[NEW FEATURE] #118: stri_wrap()
gained indent
, exdent
, initial
,
and prefix
arguments. Moreover, Knuth's dynamic word wrapping algorithm
now assumes that the cost of printing the last line is zero, see #128.
[NEW FEATURE] #122: stri_subset()
gained an omit_na
argument.
[NEW FEATURE] stri_list2matrix()
gained an n_min
argument.
[NEW FEATURE] #126: stri_split()
is now also able to act
just like stringr::str_split_fixed()
.
[NEW FEATURE] #119: stri_split_boundaries()
now has
n
, tokens_only
, and simplify
arguments. Additionally,
stri_extract_all_words()
is now equipped with simplify
arg.
[NEW FEATURE] #116: stri_paste()
gained a new argument:
ignore_null
. Setting it to TRUE
makes this function more compatible
with paste()
.
[OTHER] #123: useDynLib
is used to speed up symbol look-up in
the compiled dynamic library.
[BUGFIX] #114: stri_paste()
: could return result in an incorrect order.
[BUGFIX] #94: Run-time errors on Solaris caused by setting
-DU_DISABLE_RENAMING=1
- memory allocation errors in, among others,
the ICU UnicodeString
. This setting also caused some ASAN
sanity check
failures within ICU code.
[IMPORTANT CHANGE] #87: %>%
overlapped with the pipe operator from
the magrittr
package; now each operator like %>%
has been renamed %s>%
.
[IMPORTANT CHANGE] #108: Now the BreakIterator
(for text boundary analysis)
may be more easily controlled via stri_opts_brkiter()
(see options type
and locale
which aim to replace now-removed boundary
and locale
parameters to stri_locate_boundaries()
, stri_split_boundaries()
,
stri_trans_totitle()
, stri_extract_words()
, and stri_locate_words()
).
[NEW FUNCTIONS] #109: stri_count_boundaries()
and stri_count_words()
count the number of text boundaries in a string.
[NEW FUNCTIONS] #41: stri_startswith_*()
and stri_endswith_*()
determine whether a string starts or ends with a given pattern.
[NEW FEATURE] #102: stri_replace_all_*()
now all have the vectorize_all
parameter, which defaults to TRUE
for backward compatibility.
[NEW FUNCTION] #91: Added stri_subset_*()
- a convenient and more efficient
substitute for str[stri_detect_*(str, ...)]
.
[NEW FEATURE] #100: stri_split_fixed()
, stri_split_charclass()
,
stri_split_regex()
, stri_split_coll()
gained a tokens_only
parameter,
which defaults to FALSE
for backward compatibility.
[NEW FUNCTION] #105: stri_list2matrix()
converts lists of atomic vectors
to character matrices, useful in conjunction with stri_split()
and stri_extract()
.
[NEW FEATURE] #107: stri_split_*()
now allow
setting an omit_empty=NA
argument.
[NEW FEATURE] #106: stri_split()
and stri_extract_all()
gained a simplify
argument
(if TRUE
, then stri_list2matrix(..., byrow=TRUE)
is called on the resulting list).
[NEW FUNCTION] #77: stri_rand_lipsum()
generates
a (pseudo)random dummy lorem ipsum text.
[NEW FEATURE] #98: stri_trans_totitle()
gained a opts_brkiter
parameter; it indicates which ICU BreakIterator
should be used when
case mapping.
[NEW FEATURE] stri_wrap()
gained a new parameter: normalize
.
[BUGFIX] #86: stri_*_fixed()
, stri_*_coll()
, and stri_*_regex()
could
give incorrect results if one of search strings were of length 0.
[BUGFIX] #99: stri_replace_all()
did not use the replacement
arg.
[BUGFIX] #112: Some of the objects were not PROTECTed from garbage collection - this could have led to spontaneous SEGFAULTS.
[BUGFIX] Some collator's options were not passed correctly to ICU services.
[BUGFIX] Memory leaks as detected by
valgrind --tool=memcheck --leak-check=full
have been removed.
[DOCUMENTATION] Significant extensions/clean ups in the stringi manual.
icudt
is not available.stri_*_fixed()
.[IMPORTANT CHANGE] stri_cmp*()
now do not allow for passing
opts_collator=NA
. From now on, stri_cmp_eq()
, stri_cmp_neq()
,
and the new operators %===%
, %!==%
, %stri===%
, and %stri!==%
are locale-independent operations, which base on code point comparisons.
New functions stri_cmp_equiv()
and stri_cmp_nequiv()
(and from now on also %==%
, %!=%
, %stri==%
, and %stri!=%
)
test for canonical equivalence.
[IMPORTANT CHANGE] stri_*_fixed()
search functions now perform
a locale-independent exact (byte-wise, of course after conversion to UTF-8)
pattern search. All the Collator
-based, locale-dependent search routines
are now available via stri_*_coll()
. The reason behind this is that
ICU's USearch
has currently very poor performance. What is more,
in many search tasks exact pattern matching is sufficient anyway.
[GENERAL] stri_*_fixed
now use a tweaked Knuth-Morris-Pratt search
algorithm which improves the search performance drastically.
[IMPORTANT CHANGE] stri_enc_nf*()
and stri_enc_isnf*()
function families
have been renamed stri_trans_nf*()
and stri_trans_isnf*()
,
respectively -- they deal with text transforming,
and not with character encoding. Note that all of these may
be performed by ICU's Transliterator
too (see below).
[NEW FUNCTION] stri_trans_general()
and stri_trans_list()
give access
to ICU's Transliterator
: they may be used to perform some generic
text transforms, like Unicode normalization, case folding, etc.
[NEW FUNCTION stri_split_boundaries()
uses ICU's BreakIterator
to split strings at specific text boundaries. Moreover,
stri_locate_boundaries()
indicates positions of these boundaries.
[NEW FUNCTION] stri_extract_words()
uses ICU's BreakIterator
to
extract all words from a text. Additionally, stri_locate_words()
locates start and end positions of words in a text.
[NEW FUNCTION] stri_pad()
, stri_pad_left()
, stri_pad_right()
,
and stri_pad_both()
pad a string with a specific code point.
[NEW FUNCTION] stri_wrap()
breaks paragraphs of text into lines.
Two algorithms (greedy and minimal raggedness) are available.
[IMPORTANT CHANGE] stri_*_charclass()
search functions now
rely solely on ICU's UnicodeSet
patterns. All the previously accepted
charclass identifiers became invalid. However, new patterns
should now be more familiar to the users (they are regex-like).
Moreover, we observe a very nice performance gain.
[IMPORTANT CHANGE] stri_sort()
now does not include NA
s
in output vectors by default, for compatibility with sort()
.
Moreover, currently none of the input vector's attributes are preserved.
[NEW FUNCTION] stri_unique()
extracts unique elements from
a character vector.
[NEW FUNCTIONS] stri_duplicated()
and stri_duplicated_any()
determine duplicate elements in a character vector.
[NEW FUNCTION] stri_replace_na()
replaces NA
s in a character vector
with a given string, useful for emulating, e.g., R's paste()
behavior.
[NEW FUNCTION] stri_rand_shuffle()
generates a random permutation
of code points in a string.
[NEW FUNCTION] stri_rand_strings()
generates random strings.
[NEW FUNCTIONS] New functions and binary operators for string comparison:
stri_cmp_eq()
, stri_cmp_neq()
, stri_cmp_lt()
, stri_cmp_le()
,
stri_cmp_gt()
, stri_cmp_ge()
, %==%
, %!=%
, %<%
, %<=%
,
%>%
, %>=%
.
[NEW FUNCTION] stri_enc_mark()
reads declared encodings of character
strings as seen by stringi.
[NEW FUNCTION] stri_enc_tonative(str)
is an alias to
stri_encode(str, NULL, NULL)
.
[NEW FEATURE] stri_order()
and stri_sort()
now have an additional
argument na_last
(defaults to TRUE
and NA
, respectively).
[NEW FEATURE] stri_replace_all_charclass()
, stri_extract_all_charclass()
,
and stri_locate_all_charclass()
now have a new argument, merge
(defaults to FALSE
for backward-compatibility). It may be used
to, e.g., replace sequences of white spaces with a single space.
[NEW FEATURE] stri_enc_toutf8()
now has a new validate
arg (defaults
to FALSE
for backward-compatibility). It may be used in a (rare) case
where a user wants to fix an invalid UTF-8 byte sequence.
stri_length()
(among others) now detects invalid UTF-8 byte sequences.
[NEW FEATURE] All binary operators %???%
now also have aliases %stri???%
.
[GENERAL] Performance improvements in StriContainerUTF8
and StriContainerUTF16
(they affect most other functions).
[GENERAL] Significant performance improvements in stri_join()
,
stri_flatten()
, stri_cmp()
, stri_trans_to*()
, and others.
[GENERAL] Added 3rd mirror site for our icudt
binary distribution.
U_MISSING_RESOURCE_ERROR
message in StriException
now suggests
calling stri_install_check()
.
[BUGFIX] UTF-8 BOMs are now silently removed from input strings.
[BUGFIX] No more attempts to re-encode UTF-8 encoded strings
if native encoding is UTF-8 in StriContainerUTF8
.
[BUGFIX] Possible memory leaks when throwing errors via Rf_error()
.
[BUGFIX] stri_order()
and stri_cmp()
could return incorrect results
for opts_collator=NA
.
[BUGFIX] stri_sort()
did not guarantee to return strings in UTF-8.
LICENSE tweaks.
Initial CRAN release.
Fixed bugs detected with ASAN
and UBSAN
,
e.g., fixed CharClass::gcmask
type (enum
-> uint32_t
)
(reported by UBSAN
).
Fixed array over-runs detected with valgrind
in string8.h
.
Fixed unitialized class fields in StriContainerUTF8
(reported by valgrind
).
License changed to BSD-3-clause, COPYRIGHTS updated.
icudt
is not shipped with stringi anymore;
it is now downloaded in install.libs.R
from one of our servers.
New functions: stri_install_check()
, stri_install_icudt()
.
System ICU is used on systems which do have one (version >= 50 needed).
ICU is autodetected with pkg-config
in ./configure
.
Pass '--disable-pkg-config'
to ./configure
to force building
ICU from sources.
icudt52b
(custom subset) is now shipped with stringi
(for big-endian, ASCII systems).
ICU4C 52.1 sources included (common, i18n, stubdata + icu52dt.dat loaded dynamically). Compilation via Makevars.
stringi now does not depend on any external libraries.
ICU4C is now statically linked on Windows.
First OS X binary build.
The package is being intensively tested by our students @ FMIS WUT.
pkg-config
via ./configure
to look for ICU4C libs.First Windows binary build.
Compilation passed on Oracle Sun Studio compiler collection.
By now we have implemented most of the functionality scheduled for milestone 0.1.