A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package. All function and argument names (and positions) are consistent, all functions deal with "NA"'s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another.
Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provide a cohesive set of functions designed to make working with strings as easy as posssible. If you're not familiar with strings, the best place to start is the chapter on strings in R for Data Science.
stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine. If you find that stringr is missing a function that you need, try looking in stringi. Both packages share similar conventions, so once you've mastered stringr, you should find stringi similarly easy to use.
install.packages("stringr")# Install the cutting edge development version from GitHub:# install.packages("devtools")devtools::install_github("tidyverse/stringr")
All functions in stringr start with
str_ and take a vector of strings as the first argument.
x <- c("why", "video", "cross", "extra", "deal", "authority")str_length(x)#>  3 5 5 5 4 9str_c(x, collapse = ", ")#>  "why, video, cross, extra, deal, authority"str_sub(x, 1, 2)#>  "wh" "vi" "cr" "ex" "de" "au"
Most string functions work with regular expressions, a concise language for describing patterns of text. For example, the regular expression
"[aeiou]" matches any single character that is a vowel:
str_subset(x, "[aeiou]")#>  "video" "cross" "extra" "deal" "authority"str_count(x, "[aeiou]")#>  0 3 1 2 2 4
There are seven main verbs that work with patterns:
str_detect(x, pattern) tells you if there's any match to the pattern.
str_detect(x, "[aeiou]")#>  FALSE TRUE TRUE TRUE TRUE TRUE
str_count(x, pattern) counts the number of patterns.
str_count(x, "[aeiou]")#>  0 3 1 2 2 4
str_subset(x, pattern) extracts the matching components.
str_subset(x, "[aeiou]")#>  "video" "cross" "extra" "deal" "authority"
str_locate(x, pattern) gives the position of the match.
str_locate(x, "[aeiou]")#> start end#> [1,] NA NA#> [2,] 2 2#> [3,] 3 3#> [4,] 1 1#> [5,] 2 2#> [6,] 1 1
str_extract(x, pattern) extracts the text of the match.
str_extract(x, "[aeiou]")#>  NA "i" "o" "e" "e" "a"
str_match(x, pattern) extracts parts of the match defined by parentheses.
# extract the characters on either side of the vowelstr_match(x, "(.)[aeiou](.)")#> [,1] [,2] [,3]#> [1,] NA NA NA#> [2,] "vid" "v" "d"#> [3,] "ros" "r" "s"#> [4,] NA NA NA#> [5,] "dea" "d" "a"#> [6,] "aut" "a" "t"
str_replace(x, pattern, replacemnt) replaces the matches with new text.
str_replace(x, "[aeiou]", "?")#>  "why" "v?deo" "cr?ss" "?xtra" "d?al" "?uthority"
str_split(x, pattern) splits up a string into multiple pieces.
str_split(c("a,b", "c,d,e"), ",")#> []#>  "a" "b"#>#> []#>  "c" "d" "e"
As well as regular expressions (the default), there are three other pattern matching engines:
fixed(): match exact bytes
coll(): match human letters
boundary(): match boundaries
R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R.
Uses consistent function and argument names. The first argument is always the vector of strings to modify, which makes stringr work particularly well in conjunction with the pipe:
letters %>%.[1:10] %>%str_pad(3, "right") %>%str_c(letters[2:11])#>  "a b" "b c" "c d" "d e" "e f" "f g" "g h" "h i" "i j" "j k"
Simplifies string operations by eliminating options that you don't need 95% of the time.
Produces outputs than can easily be used as inputs. This includes ensuring that missing inputs result in missing outputs, and zero length inputs result in zero length outputs.
str_match_all()now returns NA if an optional group doesn't match (previously it returned ""). This is more consistent with
str_match()and other match failures (#134).
replacement can now be a function that is called once
for each match and who's return value is used to replace the match.
A new vignette (
vignette("regular-expressions")) describes the
details of the regular expressions supported by stringr.
The main vignette (
vignette("stringr")) has been updated to
give a high-level overview of the package.
str_sort() gain explicit
numeric argument for sorting
mixed numbers and strings.
str_replace_all() now throws an error if
replacement is not a character
NA_character_ it replaces the complete string
with replaces with
All functions that take a locale (e.g.
default to "en" (English) to ensure that the default is consistent across
Add sample datasets:
coll() now throw an error if you use them with
anything other than a plain string (#60). I've clarified that the replacement
boundary() has improved
defaults when splitting on non-word boundaries (#58, @lmullen).
str_detect() now can detect boundaries (by checking for a
str_count() > 0)
str_subset() works similarly.
str_extract_all() now work with
boundary(). This is
particularly useful if you want to extract logical constructs like words
str_extract_all() respects the
when used with
str_subset() now respects custom options for
str_replace_all() now behave correctly when a
replacement string contains
\\\\1, etc. (#83, #99).
str_split() gains a
simplify argument to match
str_view_all() create HTML widgets that display regular
expression matches (#96).
NA for indexes greater than number of words (#112).
stringr is now powered by stringi instead of base R regular expressions. This improves unicode and support, and makes most operations considerably faster. If you find stringr inadequate for your string processing needs, I highly recommend looking at stringi in more detail.
stringr gains a vignette, currently a straight forward update of the article that appeared in the R Journal.
str_c() now returns a zero length vector if any of its inputs are
zero length vectors. This is consistent with all other functions, and
standard R recycling rules. Similarly, using
str_c("x", NA) now
NA. If you want
str_replace_na() on the inputs.
str_replace_all() gains a convenient syntax for applying multiple pairs of
pattern and replacement to the same vector:
input <- c("abc", "def")str_replace_all(input, c("[ad]" = "!", "[cf]" = "?"))
str_match() now returns NA if an optional group doesn't match
(previously it returned ""). This is more consistent with
and other match failures.
str_subset() keeps values that match a pattern. It's a convenient
x[str_detect(x)] (#21, @jiho).
str_sort() allow you to sort and order strings
in a specified locale.
str_conv() to convert strings from specified encoding to UTF-8.
boundary() allows you to count, locate and split by
character, word, line and sentence boundaries.
The documentation got a lot of love, and very similar functions (e.g. first and all variants) are now documented together. This should hopefully make it easier to locate the function you need.
ignore.case(x) has been deprecated in favour of
fixed|regex|coll(x, ignore.case = TRUE),
perl(x) has been deprecated in
str_join() is deprecated, please use
fixed path in
str_wrap example so works for more R installations.
remove dependency on plyr
Zero input to
str_split_fixed returns 0 row matrix with
perl that switches to Perl regular expressions
str_match now uses new base function
regmatches to extract matches -
this should hopefully be faster than my previous pure R algorithm
str_wrap function which gives
strwrap output in a more convenient
word function extract words from a string given user defined
separator (thanks to suggestion by David Cooper)
str_locate now returns consistent type when matching empty string (thanks
to Stavros Macrakis)
str_count counts number of matches in a string.
str_trim receive performance tweaks - for large vectors this
should give at least a two order of magnitude speed up
str_length returns NA for invalid multibyte strings
fix small bug in internal