A third fundamental type of data is character data (also calledstring data). In R, character vectors may be used as data foranalysis, and also as names of data objects or elements. (Ris its own macro language, and we use the same functionsto manipulate language elements as we use to manipulate datavalues for analysis. In R, it is all “data”.)
As data for analysis, character values can signify categories(see Chapter 9). Anexample might be a variable that classifies people as“Democrat”, “Green”, “Independent”, “Libertarian”, or“Republican” (American political affiliations).
affiliations <- c("Dem", "Dem", "Rep", "Rep", "Ind", "Lib")table(affiliations)
affiliationsDem Ind Lib Rep 2 1 1 2
A single character value might also representmultiple categorical variables. The FIPS code “55025” is a combinationof state (“55” for Wisconsin) and county (“025” for Dane) codes. And thedate “2020-12-23” is a combination of a year, a month, and aday code (see Chapter 8).
You will want to be able to combine character values into a singlevalue, and to separate a single value into parts.
Another aspect of working with character data is that theymay represent the raw input from multiple people. If youhave ever used social media you will appreciate thatpeople’s views of acceptable capitalization, spelling,and punctuation vary enormously. Cleaning raw data is anotherimportant part of working with character data.
7.1 Combining Character Values
One basic task when working with character data is tocombine elements from two or more vectors. This isuseful whenever you need to construct a single variableto represent a value identified by multiple othervariables. For exampleyou might have data about calendar dates given asseparate month, day, and year variables. To combinethese into a single vector, use the paste()
function(see help(paste)
).
month <- c("Apr", "Dec", "Jan")day <- c(3, 13, 23)year <- c(2001, 2009, 1997)date_str <- paste(year, month, day, sep="-")date_str
[1] "2001-Apr-3" "2009-Dec-13" "1997-Jan-23"
The paste()
operation is vectorized in much the same way thatnumeric operations are. Notice that the results are character values.
The sep
argument specifies a character value to place betweenthe data elements being combined. The default separator is aspace. To have nothing added between the elements being combined, we caneither specify a null string, sep=""
(quotes with NO spacebetween), or we can use the paste0()
function.
You might also use this if you were constructing a set ofvariable names with a common prefix. Notice the recyclingin this example.
paste("Q", 1:4, sep="")
[1] "Q1" "Q2" "Q3" "Q4"
paste0("Q", 3, c("a", "b", "c"))
[1] "Q3a" "Q3b" "Q3c"
paste()
and paste0()
recycles each argument so that it matches the length of the longest argument, and then it concatenates element-wise. In the paste()
statement above, the longest argument (1:4
) is four elements long, so all others (here, just "Q"
) are recycled to length four (c("Q", "Q", "Q", "Q")
). In the paste0()
statement, the longest argument (c("a", "b", "c")
) has three elements, so the others ("Q"
and 3
) are recycled until they are three elements long (c("Q", "Q", "Q")
and c(3, 3, 3)
). Then, they are concatenated element-by-element (the first element of each vector, the second element of each, and so on). Note that paste()
will recycle an argument a non-whole number of times without a warning. Try paste0(c("a", "b", "c"), 1:2, "z")
and notice how 1:2
is recycled to c(1, 2, 1)
to have a length of three.
7.2 Working Within Character Values
A character value is an ordered collection of charactersdrawn from some alphabet. R is capable of working within a“local” alphabet, converting locales, or working in Unicode(a universal alphabet). The details of switching alphabetsgets complicated quickly, so we will skip that here.
The most basic manipulations of character datavalues are selecting specific characters in a value (matching), removingselected characters, or adding characters.
Matching can be done either by position within a value, or by character.
In the character value “12:08pm” we could operate onthe fourth and fifth characters (to find “08”), or we canspecify that we want tooperate on the character pair “08” (finding the fourth position). Weare either looking for an arbitrary character that occupies a specificposition, or we are looking for an arbitrary position occupied bya specific character.
7.3 Position Indexing
The substr()
and substring()
functions usepositions and return characters values. Theregexpr()
function matches characters and returns starting positions(it is an index function).
x <- c("12:08pm", "12:10pm")substr(x, start=4, stop=5)
[1] "08" "10"
regexpr("08", x)
[1] 4 -1attr(,"match.length")[1] 2 -1attr(,"index.type")[1] "chars"attr(,"useBytes")[1] TRUE
Here the character string sought is found at the fourthposition in the first value, and not at all (-1) in thesecond value. (Everything after the first line ofoutput is metadata, which makes this output hard to read.)
Although this example of regexpr()
is very simple, a word of warning.Character matching functions is R rely on regular expressions, asystem of specifying character patterns that includes literalcharacters (as in the example above), wildcards, and positionanchors. We’ll come back to this, below.
To work with positions, if is often useful to know the length ofa character value, for which we have the nchar()
function.
If we wanted the last two characters of each character value, wecould specify
x <- c("12:08pm", "12:10pm", "1:08pm")x_len <- nchar(x)substr(x, start=x_len-1, stop=x_len)
[1] "pm" "pm" "pm"
By using substr()
on the left-hand side of the assignmentoperator, we can substitute in new substrings by position.
substr(x, start=x_len-1, stop=x_len) <- "am"x
[1] "12:08am" "12:10am" "1:08am"
7.4 Character Matching
While some character value manipulations are easily handledby position indexing, many others are handled more gracefullythrough character matching.
7.4.1 Global Wildcards
You may already be familiar with the concept of wildcardsto specify patterns - computer operating systems allallow wildcards in searching for file names. Theseare sometimes referred to as global or glob wildcards.
For example, on a Windows computer you could opena Command window and type the following command
dir /b 0*.rmd
to get a list of all the Rmarkdown (“rmd”) files in yourcurrent folder beginning with the character “0”(if there are any).
02_data_types.rmd03_data_structures.rmd04_data_class.Rmd05_numeric.Rmd06_logical.rmd07_character.RMD08_dates.RMD09map_Data_Wrangling_roadmap.RMD09_categorical.rmd
The asterisk (*
) wildcard matches any characters (zero ormore) after the literal “0” atthe beginning of file names, while the “.rmd” literalmatches only files which end with that file extension.
Similarly, a question mark matches a single arbitrarycharacter.
(In RStudio,you could open the Terminal tab, next to the Console, fora Unix-like shell. Drop the “/b”.)
With global wildcards, the pattern to match alwaysspecifies a string from beginning to end. So
"02\*"
matches any character string beginning with “02”"\*02"
matches any string ending with “02”"\*02\*"
matches any string containing “02”.
R does not use global wildcards directly, butthe glob2rx
function can translate this typeof wildcard into a regular expression for you.
glob2rx("02*")
[1] "^02"
glob2rx("*02")
[1] "^.*02$"
glob2rx("*02*")
[1] "^.*02"
7.4.2 Regular Expression Wildcards
Regular expressions expand on the concept of wildcards,and allow us to match elements of arbitrary charactervectors with more precise patterns than global wildcards.We expand the concept of a wildcard by separating
- what characters to match
- how many characters to match
- where to match (what position)
A single arbitrary character is specified as a period, “.”,much like the globalquestion mark, “?”. For example, one way to get a vectorof column names that are at least four letters longusing a regular expression would be
cars <- mtcarsgrep("....", names(cars), value=TRUE)
[1] "disp" "drat" "qsec" "gear" "carb"
The grep()
function searches a character vector for elementsthat match a pattern. It returns position indexes by default,or values that contain a match with the value=TRUE
argument. The grepl()
(grep logical) function returns a logical vectorindicating which elements matched. These two functionsgive us all three methods of specifying indexes along avector.
In addition to wildcard characters,we can also match literal characters, and literal substrings.
grep("a", names(cars), value=TRUE)
[1] "drat" "am" "gear" "carb"
grep("ar", names(cars), value=TRUE)
[1] "gear" "carb"
7.4.2.1 Position
In contrast to global wildcards, these patterns matchanywhere within a character value - they are position-less.To specify positions We have two regular expressionanchors that wecan specify - tying a pattern to the beginning (“^”) or theend (“$”) of a string.
grep("m", names(cars), value=TRUE) # any m
[1] "mpg" "am"
grep("^m", names(cars), value=TRUE) # begins with m
[1] "mpg"
grep("m$", names(cars), value=TRUE) # ends with m
[1] "am"
Although we have added position qualifiers to ourpatterns, notice that we are still specifying partialstrings, not whole strings. To specify a completestring, we use both anchors! One way to find columnnames that are exactly two characters long would be
grep("^..$", names(cars), value=TRUE)
[1] "hp" "wt" "vs" "am"
Without both anchors, this example would find allcolumn names at least two characters long, includingthose with three and four characters.
7.4.2.2 Repetition
So far we have been specifying one character at a time,but the regular expression syntax also includes theconcept of repetition. There are six ways to specifyhow many matches are required:
- a question mark, “?”, matches zero or one time,making a character specification optional
- an asterisk, "*", matches zero or more times, a characteris optional but also may be repeated
- a plus, “+”, matches one or more times, a character isrequired and may be repeated
- braces with a number, “{n}” matches exactly n times
- braces with two numbers, “{n,m}”, matchesat least n times and no more than m times
- no repetition qualifier means match exactly once
So another way to get two-letter column names wouldbe to specify
grep("^.{2}$", names(cars))
[1] 4 6 8 9
While the global wildcard “?” is replaced by thedot in regular expressions, the global wildcard “" isreplaced with the regular expression ".”.
7.4.2.3 Character Class
So far we have introduced arbitrary matches and literalmatches, but regular expressions are able to work betweenthese two extremes as well. We can specify classes(sets) of characters to match, and we can do this byitemizing the whole class, or using a shortcut namefor some classes.
Square brackets, “[ ]”, are used to itemize classes, and tospecify shortcut names. As an arbitrary example, columnnames that begin with “a” or “b” or “c” could be specified
grep("^[abc]", names(cars), value=TRUE)
[1] "cyl" "am" "carb"
Notice that this interacts with with the repetition qualifiers.To require the first two characters to belong to the samecharacter class we would specify
grep("^[abc]{2}", names(cars), value=TRUE)
[1] "carb"
The twelve shortcut names that are predefined in R aredocumented on the help("regex")
page, and they include
- [:alpha:], alphabetic characters
- [:digit:], numerals
- [:punct:], punctuation marks
These shortcuts are specific to the use of regular expressions in R,and must themselves be used within class brackets. Contrastthe first (correct) example with the second (incorrect)example. Why does the second line pick out “def”?
grep("[[:digit:]]", c("abc", "123", "def"), value=TRUE)
[1] "123"
grep("[:digit:]", c("abc", "123", "def"), value=TRUE)
[1] "def"
7.4.2.4 Regular Expression Metacharacters
We can match arbitrary characters, specified classesof characters, and most literal characters. However,we are using some characters as metacharacters withspecial meaning in regular expression patterns. Thedot (period, .
), asterisk (*
), question mark (?
), plus sign (+
),caret (^
), dollar sign ($
), square brackets ([
, ]
), braces ({
, }
), dash (-
), anda few more we haven’t discussed, allhave a non-literal meaning. What if we want to usethese as literal characters?
There are generally two ways to take a metacharacterand use it as a literal. We can specify it withina square bracket class, or we can escape it. Eithermethod comes with caveats.
To “escape” a character - ignore it’s special meaningand use it as a literal - we typically think of precedingit with a backslash, " \ ". However, it turns out thata backslash is also a regular expression metacharacter(that we have not discussed so far), so to use it asan escape character in regular expressions we double it.That is, to use an escape character, we need to first escape the escape character!
For example, to find a literal dollar sign contrast thecorrect specification with two mistakes.
grep("\\$", c("$12.95", "40502"), value=TRUE) # correctgrep("$", c("$12.95", "40502"), value=TRUE) # wrong: no slash = end of stringgrep("\$", c("$12.95", "40502"), value=TRUE) # wrong: error
Error: '\$' is an unrecognized escape in character string starting ""\$"
An alternative is to write (most) metacharacters within aclass.
grep("[$]", c("$12.95", "40502"), value=TRUE) # correct
[1] "$12.95"
The caveat here is that the caret, dash, and backslashall have special meaning within character classes.
(A third approach is to turn off regular expressionmatching and use only literal matching. Use thefixed=TRUE
argument.)
7.4.2.5 More
There is more to regular expression specification:negation, alternation, substrings, and other less usedelements. Whole books have been written aboutregular expressions. To learn more, a useful referenceis
7.5 Substitution
Simply identifying matching values is useful for creatingindicator variables (grepl
) or for creating sets ofvariable names (grep
). But for data values, we oftenwant to manipulate the values we identify. One of themain tools we will use for this is substitution (sub
)and repeated substitution (gsub
).
Substitution where a regular expression pattern appearsat most once in each value is straightforward.
Returning to the time example, which we previouslysolved by positional substitution, we can use thevery simple regular expression “pm” to identifymatching characters to replace with “am”.
x <- c("12:08pm", "12:10pm", "1:08pm")sub("pm", "am", x)
[1] "12:08am" "12:10am" "1:08am"
Substitution also works as a method of deletingmatched characters, when the replacement is thenull string (quotes with no space).
7.6 Character Vector Exercises
Percent to fraction
Given a character vector with values in percentform, convert these to numerical proportions,values between 0 and 1.
x <- sprintf("%4.2f%%", runif(5)*100)x
[1] "39.62%" "80.94%" "70.43%" "79.63%" "26.59%"
Currency
Currency is sometimes denoted with both acurrency symbol and commas. Convert theseto numeric values.
x <- c("$10", "$11.99", "$1,011.01")
Grouping values
There are really just three colors in thisvector.
colors <- c("red", "rot", "rouge", "blue", "blau", "bleue", "bleu", "green")
Recode these data into three color categories.
Capitalization
Inconsistent capitalization is a problemwith some alphabets.
colors2 <- c("Red", "blue", "red", "blue", "RED")
A quick solution is often to make everythingeither lower case or upper case. R hasdedicated functions for this, see
help("tolower")
and fix this vector.Translating wildcard patterns
The
glob2rx()
function translates character strings with wildcards (*
for any string,?
for a single character) into regular expressions. We can translate "Merc *" (a string starting with “Merc” and a space, followed by anything) into “^Merc”. Combining this withgrep()
allows us to select rows from ourmtcars
dataframe. (Note thatvalue = TRUE
returns values, while the defaultvalue = FALSE
returns positions.)glob2rx("Merc *")
[1] "^Merc "
grep(glob2rx("Merc *"), row.names(mtcars), value=TRUE)
[1] "Merc 240D" "Merc 230" "Merc 280" "Merc 280C" "Merc 450SE" "Merc 450SL" [7] "Merc 450SLC"
grep(glob2rx("Merc *"), row.names(mtcars))
[1] 8 9 10 11 12 13 14
mtcars[grep(glob2rx("Merc *"), row.names(mtcars)), ]
mpg cyl disp hp drat wt qsec vs am gear carbMerc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
Now, try selecting rows from
mtcars
where the row name…- starts with “Toyota” and a space
- starts with any four characters and then a space
- ends with 0
- ends with a space and then any three characters