Unicode Text Processing

Process and print 'UTF-8' encoded international text (Unicode). Input, validate, normalize, encode, format, and display.


Build Status (Linux) Build Status (Windows) Coverage Status CRAN Status License CRAN RStudio Mirror Downloads

utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.

Installation

utf8 is available on CRAN. To install the latest released version, run the following command in R:

install.packages("utf8")

Development version

To install the latest development version, run the following:

tmp <- tempfile()
system2("git", c("clone", "--recursive", shQuote("https://github.com/patperry/r-utf8.git"), shQuote(tmp)))
devtools::install(tmp)

Note that utf8 uses a git submodule, so you cannot use devtools::install_github.

Usage

Validate character data and convert to UTF-8

Use as_utf8 to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:

# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails
#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4
 
# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
#> [1] "façile" "façile" "façile"

Normalize data

Use utf8_normalize to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.

# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
#> [1] "Å" "Å" "Å"
utf8_normalize(angstrom) == "\u00c5"
#> [1] TRUE TRUE TRUE
 
# perform full Unicode case-folding
utf8_normalize("Größe", map_case = TRUE)
#> [1] "grösse"
 
# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("𝖸𝗈 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝗅 𝗁𝖾𝗋𝖽 𝕌 𝗅𝗂𝗄𝖾 𝑡𝑦𝑝𝑒𝑓𝑎𝑐𝑒𝑠 𝗌𝗈 𝗐𝖾 𝗉𝗎𝗍 𝗌𝗈𝗆𝖾 𝚌𝚘𝚍𝚎𝚙𝚘𝚒𝚗𝚝𝚜 𝗂𝗇 𝗒𝗈𝗎𝗋 𝔖𝔲𝔭𝔭𝔩𝔢𝔪𝔢𝔫𝔱𝔞𝔯𝔶 𝔚𝔲𝔩𝔱𝔦𝔩𝔦𝔫𝔤𝔳𝔞𝔩 𝔓𝔩𝔞𝔫𝔢 𝗌𝗈 𝗒𝗈𝗎 𝖼𝖺𝗇 𝓮𝓷𝓬𝓸𝓭𝓮 𝕗𝕠𝕟𝕥𝕤 𝗂𝗇 𝗒𝗈𝗎𝗋 𝒇𝒐𝒏𝒕𝒔.",
               map_compat = TRUE)
#> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."

Print emoji

On some platforms (including MacOS), the R implementation of print uses an outdated version of the Unicode standard to determine which characters are printable. Use utf8_print for an updated print function:

print(intToUtf8(0x1F600 + 0:79)) # with default R print function
#> [1] "\U0001f600\U0001f601\U0001f602\U0001f603\U0001f604\U0001f605\U0001f606\U0001f607\U0001f608\U0001f609\U0001f60a\U0001f60b\U0001f60c\U0001f60d\U0001f60e\U0001f60f\U0001f610\U0001f611\U0001f612\U0001f613\U0001f614\U0001f615\U0001f616\U0001f617\U0001f618\U0001f619\U0001f61a\U0001f61b\U0001f61c\U0001f61d\U0001f61e\U0001f61f\U0001f620\U0001f621\U0001f622\U0001f623\U0001f624\U0001f625\U0001f626\U0001f627\U0001f628\U0001f629\U0001f62a\U0001f62b\U0001f62c\U0001f62d\U0001f62e\U0001f62f\U0001f630\U0001f631\U0001f632\U0001f633\U0001f634\U0001f635\U0001f636\U0001f637\U0001f638\U0001f639\U0001f63a\U0001f63b\U0001f63c\U0001f63d\U0001f63e\U0001f63f\U0001f640\U0001f641\U0001f642\U0001f643\U0001f644\U0001f645\U0001f646\U0001f647\U0001f648\U0001f649\U0001f64a\U0001f64b\U0001f64c\U0001f64d\U0001f64e\U0001f64f"
 
utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​😤​😥​😦​😧​😨​😩​😪​😫​…"
 
utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​😤​😥​😦​😧​😨​😩​😪​😫​😬​😭​😮​😯​😰​😱​😲​😳​😴​😵​😶​😷​😸​😹​😺​😻​😼​😽​😾​😿​🙀​🙁​🙂​🙃​🙄​🙅​🙆​🙇​🙈​🙉​🙊​🙋​🙌​🙍​🙎​🙏​"

Citation

Cite utf8 with the following BibTeX entry:

@Manual{,
  title = {utf8: Unicode Text Processing},
  author = {Patrick O. Perry},
  year = {2018},
  note = {R package version 1.1.4},
  url = {https://github.com/patperry/r-utf8},
}

Contributing

The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you'd like to contribute, either

This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.

News

utf8 1.1.4 (2018-05-24)

BUG FIXES

  • Fix build on Solaris (#7, reported by @krlmlr).

  • Fix rendering of emoji ZWJ sequences like "\U1F469\U200D\U2764\UFE0F\U200D\U1F48B\U200D\U1F469".

utf8 1.1.3 (2018-01-03)

MINOR IMPROVEMENTS

  • Make output_utf8() always return TRUE on Windows, so that characters in the user's native locale don't get escaped by utf8_encode(). The downside of this change is that on Windows, utf8_width() reports the wrong values for characters outside the user's locale when stdout() is redirected by knitr or another process.

  • When truncating long strings strings via utf8_format(), use an ellipsis that is printable in the user's native locale ("\u2026" or "...").

utf8 1.1.2 (2017-12-14)

BUG FIXES

  • Fix bug in utf8_format() with non-NULL width argument.

utf8 1.1.1 (2017-11-28)

BUG FIXES

  • Fix PROTECT bug in as_utf8().

utf8 1.1.0 (2017-11-20)

NEW FEATURES

  • Added output_ansi() and output_utf8() functions to test for output capabilities.

MINOR IMPROVEMENTS

  • Add utf8 argument to utf8_encode(), utf8_format(), utf8_print(), and utf8_width() for precise control over assumed output capabilities; defaults to the result of output_utf8().

  • Add ability to style backslash escapes with the escapes arguments to utf8_encode() and utf8_print(). Switch from "faint" styling to no styling by default.

  • Slightly reword error messages for as_utf8().

  • Fix (spurious) rchk warnings.

BUG FIXES

  • Fix bug in utf8_width() determining width of non-ASCII strings when LC_CTYPE=C.

DEPRECATED AND DEFUNCT

  • No longer export the C version of as_utf8() (the R version is still present).

utf8 1.0.0 (2017-11-06)

NEW FEATURES

  • Split off functions as_utf8(), utf8_valid(), utf8_normalize(), utf8_encode(), utf8_format(), utf8_print(), and utf8_width() from corpus package.

  • Added special handling for Unicode grapheme clusters in formatting and width measurement functions.

  • Added ANSI styling to escape sequences.

  • Added ability to style row and column names in utf8_print().

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("utf8")

1.1.4 by Patrick O. Perry, 7 months ago


https://github.com/patperry/r-utf8


Report a bug at https://github.com/patperry/r-utf8/issues


Browse source code at https://github.com/cran/utf8


Authors: Patrick O. Perry [aut, cph, cre] , Unicode , Inc. [cph, dtc] (Unicode Character Database)


Documentation:   PDF Manual  


Apache License (== 2.0) | file LICENSE license


Suggests knitr, rmarkdown, testthat


Imported by corpus, deeplr, pillar.


See at CRAN