R: Character Encodings and 'stringi' (2024)

about_encoding {stringi}R Documentation

Character Encodings and stringi

Description

This manual page explains how stringi deals with characterstrings in various encodings.

In particular we should note that:

  • R lets strings in ASCII, UTF-8, and your platform'snative encoding coexist. A character vector printed on the consoleby calling print or cat issilently re-encoded to the native encoding.

  • Functions in stringi process each string internally inUnicode, the most universal character encoding ever.Even if a string is given in the native encoding, i.e., your platform'sdefault one, it will be converted to Unicode (precisely: UTF-8 or UTF-16).

  • Most stringi functions always return UTF-8 encoded strings,regardless of the input encoding. What is more, the functions have beenoptimized for UTF-8/ASCII input (they have competitive, if not betterperformance, especially when performing more complex operations likestring comparison, sorting, and even concatenation). Thus, it isbest to rely on cascading calls to stringi operations solely.

Details

Quoting the ICU User Guide,'Hundreds of encodings have been developed over the years, each for smallgroups of languages and for special purposes. As a result,the interpretation of text, input, sorting, display, and storagedepends on the knowledge of all the different types of character setsand their encodings. Programs have been written to handle eitherone single encoding at a time and switch between them, or to convertbetween external and internal encodings.'

'Unicode provides a single character set that covers the majorlanguages of the world, and a small number of machine-friendly encodingforms and schemes to fit the needs of existing applications and protocols.It is designed for best interoperability with both ASCII and ISO-8859-1(the most widely used character sets) to make it easier for Unicode to beused in almost all applications and protocols' (see the ICU User Guide).

The Unicode Standard determines the way to map any possible characterto a numeric value – a so-called code point. Such code points, however,have to be stored somehow in computer's memory.The Unicode Standard encodes characters in the range U+0000..U+10FFFF,which amounts to a 21-bit code space. Depending on the encodingform (UTF-8, UTF-16, or UTF-32), each character willthen be represented either as a sequence of one to four 8-bit bytes,one or two 16-bit code units, or a single 32-bit integer(compare the ICU FAQ).

Unicode can be thought of as a superset of the spectrum of characterssupported by any given code page.

UTF-8 and UTF-16

For portability reasons, the UTF-8 encoding is the most natural choicefor representing Unicode character strings in R. UTF-8 has ASCII as itssubset (code points 1–127 represent the same characters in both of them).Code points larger than 127 are represented by multi-byte sequences(from 2 to 4 bytes: Please note that not all sequences of bytesare valid UTF-8, compare stri_enc_isutf8).

Most of the computations in stringi are performed internallyusing either UTF-8 or UTF-16 encodings (this depends on type of serviceyou request: some ICU services are designed only to work with UTF-16).Due to such a choice, with stringi you get the same result oneach platform, which is – unfortunately – not the case of base R'sfunctions (for instance, it is known that performing a regular expressionsearch under Linux on some texts may give you a different resultto those obtained under Windows). We really had portability in our mindswhile developing our package!

We have observed that R correctly handles UTF-8 strings regardless of yourplatform's native encoding (see below). Therefore, we decided that mostfunctions in stringi will output its results in UTF-8– this speeds ups computations on cascading calls to our functions:the strings does not have to be re-encoded each time.

Note that some Unicode characters may have an ambiguous representation.For example, “a with ogonek” (one character) and “a”+“ogonek”(two graphemes) are semantically the same. stringi provides functionsto normalize character sequences, see stri_trans_nfcfor discussion. However, it is observed that denormalized stringsdo appear very rarely in typical string processing activities.

Additionally, do note that stringi silently removes byte order marks(BOMs - they may incidentally appear in a string read from a text file)from UTF8-encoded strings, see stri_enc_toutf8.

Character Encodings in R

Data in memory are just bytes (small integervalues) – an encoding is a way to represent characters with suchnumbers, it is a semantic 'key' to understand a given byte sequence.For example, in ISO-8859-2 (Central European), the value 177 representsPolish “a with ogonek”, and in ISO-8859-1 (Western European),the same value denotes the “plus-minus” sign. Thus, a character encodingis a translation scheme: we need to communicate with R somehow,relying on how it represents strings.

Overall, R has a very simple encoding marking mechanism,see stri_enc_mark. There is an implicit assumptionthat your platform's default (native) encoding always extendsASCII – stringi checks that whenever your native encodingis being detected automatically on ICU's initialization and each timewhen you change it manually by calling stri_enc_set.

Character strings in R (internally) can be declared to be in:

  • UTF-8;

  • latin1, i.e., either ISO-8859-1 (Western European onLinux, OS X, and other Unixes) or WINDOWS-1252 (Windows);

  • bytes – for strings thatshould be manipulated as sequences of bytes.

Moreover, there are two other cases:

  • ASCII – for strings consisting only of byte codesnot greater than 127;

  • native (a.k.a. unknown in Encoding;quite a misleading name: no explicit encoding mark) – forstrings that are assumed to be in your platform's native (default) encoding.This can represent UTF-8 if you are an OS X user,or some 8-bit Windows code page, for example.The native encoding used by R may be determined by examiningthe LC_CTYPE category, see Sys.getlocale.

Intuitively, “native” strings result from readinga string from stdin (e.g., keyboard input). This makes sense: your operatingsystem works in some encoding and provides R with some data.

Each time when a stringi function encounters a string declaredin native encoding, it assumes that the input data should be translatedfrom the default encoding, i.e., the one returned by stri_enc_get(unless you know what you are doing, the default encoding should only bechanged if the automatic encoding detection process fails on stringiload).

Functions which allow 'bytes' encoding markings are very rare instringi, and were carefully selected. These are:stri_enc_toutf8 (with argument is_unknown_8bit=TRUE),stri_enc_toascii, and stri_encode.

Finally, note that R lets strings in ASCII, UTF-8, and your platform'snative encoding coexist. A character vector printed withprint, cat, etc., is silently re-encodedso that it can be properly shown, e.g., on the console.

Encoding Conversion

Apart from automatic conversion from the native encoding,you may re-encode a string manually, for examplewhen you read it from a file created on a different platform.Call stri_enc_list for the list ofencodings supported by ICU.Note that converter names are case-insensitiveand ICU tries to normalize the encoding specifiers.Leading zeroes are ignored in sequences of digits (if further digits follow),and all non-alphanumeric characters are ignored. Thus the strings'UTF-8', 'utf_8', 'u*Tf08' and 'Utf 8' are equivalent.

The stri_encode functionallows you to convert between any given encodings(in some cases you will obtain bytes-markedstrings, or even lists of raw vectors (i.e., for UTF-16).There are also some useful more specialized functions,like stri_enc_toutf32 (converts a character vector to a listof integers, where one code point is exactly one numeric value)or stri_enc_toascii (substitutes all non-ASCIIbytes with the SUBSTITUTE CHARACTER,which plays a similar role as R's NA value).

There are also some routines for automated encoding detection,see, e.g., stri_enc_detect.

Encoding Detection

Given a text file, one has to know how to interpret (encode)raw data in order to obtain meaningful information.

Encoding detection is always an imprecise operation andneeds a considerable amount of data. However, in case of someencodings (like UTF-8, ASCII, or UTF-32) a “false positive” bytesequence is quite rare (statistically speaking).

Check out stri_enc_detect (among others) for a usefulfunction in this category.

Author(s)

Marek Gagolewski and other contributors

References

Unicode Basics – ICU User Guide,https://unicode-org.github.io/icu/userguide/icu/unicode.html

Conversion – ICU User Guide,https://unicode-org.github.io/icu/userguide/conversion/

Converters – ICU User Guide,https://unicode-org.github.io/icu/userguide/conversion/converters.html(technical details)

UTF-8, UTF-16, UTF-32 & BOM – ICU FAQ,https://www.unicode.org/faq/utf_bom.html

See Also

The official online manual of stringi at https://stringi.gagolewski.com/

Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02

Other stringi_general_topics: about_arguments,about_locale,about_search_boundaries,about_search_charclass,about_search_coll,about_search_fixed,about_search_regex,about_search,about_stringi

Other encoding_management: stri_enc_info(),stri_enc_list(),stri_enc_mark(),stri_enc_set()

Other encoding_detection: stri_enc_detect2(),stri_enc_detect(),stri_enc_isascii(),stri_enc_isutf16be(),stri_enc_isutf8()

Other encoding_conversion: stri_enc_fromutf32(),stri_enc_toascii(),stri_enc_tonative(),stri_enc_toutf32(),stri_enc_toutf8(),stri_encode()

[Package stringi version 1.8.3 Index]

R: Character Encodings and 'stringi' (2024)

FAQs

R: Character Encodings and 'stringi'? ›

R has specific support for UTF-8 and latin1 encoded strings. This mostly matters for internal conversions. Thanks to this support, you can reencode strings to UTF-8 or latin1 for internal processing, and return these strings without having to convert them back to the native encoding.

What is the encoding of strings in R? ›

R has specific support for UTF-8 and latin1 encoded strings. This mostly matters for internal conversions. Thanks to this support, you can reencode strings to UTF-8 or latin1 for internal processing, and return these strings without having to convert them back to the native encoding.

What is the difference between string and character in R? ›

In R, there's no fundamental distinction between a string and a character. A "string" is just a character variable that contains one or more characters. One thing you should be aware of, however, is the distinction between a scalar character variable, and a vector.

What are the different encodings in R? ›

Character strings in R can be declared to be encoded in "latin1" or "UTF-8" or as "bytes" . These declarations can be read by Encoding , which will return a character vector of values "latin1" , "UTF-8" "bytes" or "unknown" , or set, when value is recycled as needed and other values are silently treated as "unknown" .

What is UTF-8 vs UTF-16 encoding? ›

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

What encoding should I use in R? ›

If in doubt about which encoding to use, use UTF-8, as it can encode any Unicode character.

What is the underlying character encoding for strings? ›

UTF-16 encoding is used by the common language runtime to represent Char and String values, and it is used by the Windows operating system to represent WCHAR values. Represents each Unicode code point as a 32-bit integer.

When to use char vs string? ›

char represents only one single character, while String can contain multiple characters.

How do you check if a character is in a string R? ›

To check the presence of characters in a string in R language, follow the below steps:
  1. Define a string in which we want to search a substring.
  2. We use grepl() function to check for the presence of characters or substrings.
  3. Return TRUE if we found a character or substring, else return FALSE.
  4. Then print the desired result.
Oct 18, 2023

What is the main difference between character and string? ›

Strings are sequences of characters, treated as a single data type, and are useful for storing text and sentences. On the other hand, characters are individual elements representing single letters, digits, or special symbols, enclosed within single quotes.

What is UTF-8 in R? ›

utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R's UTF-8 handling.

What are the most used character encodings? ›

UTF-8 has been the most common encoding for the World Wide Web since 2008.

How many types of character encoding are there? ›

Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE, and UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using a byte order mark or escape sequences; compressing schemes try to minimize the number of bytes used ...

Why use UTF-16 instead of UTF-8? ›

UTF-16 data can potentially take more storage than UTF-8 data, but because no conversion occurs when you use UTF-16 data, you avoid a significant performance impact.

Why is UTF-16 better than UTF-8? ›

The number of valid cases is basically halved, and the number of error cases in UTF-16 is a fraction of those in UTF-8. Put another way, there are plenty more invalid UTF-8 sequences than invalid UTF-16 sequences.

Why is UTF-8 better than ASCII? ›

UTF-8 VS ASCII – What's the Difference? UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.

What is the data type of string in R? ›

Character strings are another common data type, used to represent text. In R, character strings (or simply "strings") are indicated by double quotation marks. To create a string, just enter text between two paris of these quotes. One can compare strings with the same relational operators used on numerical values.

Is there a string data type in R? ›

The data type R provides for storing sequences of characters is character. Formally, the mode of an object that holds character strings in R is "character" . The important thing is that you must match the type of quotes that your are using.

How does R encode categorical variables? ›

One-Hot Encoding

A column in the matrix is given to each distinct value in the categorical variable. The corresponding column will be given a value of 1, and all other columns will be given a value of 0, if the value is present in that particular row.

What is string function in R? ›

A string is a sequence of characters. For example, "Programming" is a string that includes characters: P, r, o, g, r, a, m, m, i, n, g. In R, we represent strings using quotation marks (double quotes, " " or single quotes, ' ' ).

Top Articles
Latest Posts
Article information

Author: Melvina Ondricka

Last Updated:

Views: 6021

Rating: 4.8 / 5 (68 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Melvina Ondricka

Birthday: 2000-12-23

Address: Suite 382 139 Shaniqua Locks, Paulaborough, UT 90498

Phone: +636383657021

Job: Dynamic Government Specialist

Hobby: Kite flying, Watching movies, Knitting, Model building, Reading, Wood carving, Paintball

Introduction: My name is Melvina Ondricka, I am a helpful, fancy, friendly, innocent, outstanding, courageous, thoughtful person who loves writing and wants to share my knowledge and understanding with you.