R数据科学|第十章内容介绍

庄闪闪

发布于 2021-04-09 11:30:32

9080

发布于 2021-04-09 11:30:32

文章被收录于专栏：庄闪闪的R语言手册

使用stringr处理字符串

本章通过学习字符串的处理，再结合正则表达式进行正确的模式匹配。

字符串基础

创建字符串

可以使用单引号或双引号来创建字符串：

string1 <- "This is a string"
string2 <- 'To put a "quote" inside a string, use single quotes'

如果想要在字符串中包含一个单引号或双引号，可以使用 \ 对其进行“转义”：

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

多个字符串通常保存在一个字符向量中，你可以使用c() 函数来创建字符向量：

c("one", "two", "three")
#> [1] "one" "two" "three"

字符串长度

str_length() 函数可以返回字符串中的字符数量：

str_length("abc")
#> [1] 3

字符串向量也适用：

str_length(c("a", "R for data science", NA))
#> [1] 1 18 NA

字符串组合

要想组合两个或更多字符串，可以使用str_c()函数：

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"

可以使用 sep 参数来控制字符串间的分隔方式：

str_c("x", "y", sep = ", ")
#> [1] "x, y"

字符串取子集

可以使用str_sub()函数来提取字符串的一部分。除了字符串参数外，str_sub() 函数中还有 start 和 end 参数，它们给出了子串的位置（包括 start 和 end 在内）：

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"

# 负数表示从后往前数
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"

使用正则表达式进行模式匹配

我们通过str_view()和str_view_all()函数来学习正则表达式。这两个函数接受一个字符向量和一个正则表达式，并显示出它们是如何匹配的。

基础匹配

str_view 是查看string是否匹配pattern，如果匹配就高亮显示：

x <- c("apple", "banana", "pear")
str_view(x, "an")

另一个更复杂一些的模式是使用 .，它可以匹配任意字符（除了换行符）：

str_view(x, ".a.")

锚点

^ 从字符串开头进行匹配。
$ 从字符串末尾进行匹配。

x <- c("apple", "banana", "pear")
str_view(x, "^a")

str_view(x, "a$")

字符类与字符选项

很多特殊模式可以匹配多个字符：

.：匹配除换行符外的任意字符
\d：匹配任意数字
\s：匹配任意空白字符（如空格、制表符和换行符）
[abc]：可以匹配 a、b 或 c
[^abc]：可以匹配除 a、b、c 外的任意字符

注意：要想创建包含 \d 或 \s 的正则表达式，你需要在字符串中对 \ 进行转义，因此需要输入 "\d" 或 "\s"。

重复

正则表达式的另一项强大功能是，其可以控制一个模式的匹配次数。

?：0 次或 1 次。
+：1 次或多次。
*：0 次或多次。

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")

str_view(x, "CC+")

str_view(x, 'C[LX]+')

还可以精确设置匹配的次数：

{n}：匹配 n 次
{n , }：匹配 n 次或更多次
{, m}：最多匹配 m 次
{n, m}：匹配 n 到 m 次

str_view(x, "C{2}")

str_view(x, "C{2,}")

str_view(x, "C{2,3}")

分组与回溯引用

以下的正则表达式可以找出名称中有重复的一对字母的所有水果：

str_view(fruit, "(.)\\1", match = TRUE)

.：匹配任意字符
(.)：将匹配项括起来，并将其命名为\\1；如果有两个括号，就命名为\\1和\\2。
\\1：表示回溯引用因此，(.)\\1的意思是，匹配到了字符，后面还希望有个相同的字符。如果想要匹配abab类型：

str_view(fruit, "(..)\\1", match = TRUE)

如果想要匹配abba类型：

str_view(fruit, "(.)(.)\\2\\1", match = TRUE)

匹配检测

要想确定一个字符向量能否匹配一种模式，可以使用str_detect()函数。

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE

str_detect() 函数的一种变体是str_count()，后者不是简单地返回是或否，而是返回字符串中匹配的数量：

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# 平均来看，每个单词中有多少个元音字母？
mean(str_count(words, "[aeiou]"))
#> [1] 1.99

提取匹配内容

要想提取匹配的实际文本，我们可以使用str_extract()函数。我们将使用维基百科上的 Harvard sentences数据集：

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks."
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."
#> [4] "These days a chicken leg is a rare dish."
#> [5] "Rice is often served in round bowls."
#> [6] "The juice of lemons makes fine punch."

假设我们想要找出包含一种颜色的所有句子。首先，我们需要创建一个颜色名称向量，然后将其转换成一个正则表达式：

colors <- c(
"red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(colors, collapse = "|")
color_match
#> [1] "red|orange|yellow|green|blue|purple"

现在我们可以选取出包含一种颜色的句子，再从中提取出颜色，就可以知道有哪些颜色了：

has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"

分组匹配

str_match()函数则可以给出每个独立分组。它返回的不是字符向量，而是一个矩阵，其中一列是完整匹配，后面的列是每个分组的匹配：

noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_match(noun)
#> [,1] [,2] [,3]
#> [1,] "the smooth" "the" "smooth"
#> [2,] "the sheet" "the" "sheet"
#> [3,] "the depth" "the" "depth"
#> [4,] "a chicken" "a" "chicken"
#> [5,] "the parked" "the" "parked"
#> [6,] "the sun" "the" "sun"
#> [7,] "the huge" "the" "huge"
#> [8,] "the ball" "the" "ball"
#> [9,] "the woman" "the" "woman"
#> [10,] "a helps" "a" "helps"

替换匹配内容

str_replace()函数可以使用新字符串替换匹配内容。

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"

通过提供一个命名向量，使用str_replace_all()函数可以同时执行多个替换：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"

除了使用固定字符串替换匹配内容，你还可以使用回溯引用来插入匹配中的分组。在下面的代码中，我们交换了第二个单词和第三个单词的顺序：

sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."

拆分

str_split()函数可以将字符串拆分为多个片段。例如，我们可以将句子拆分成单词：

sentences %>%
head(5) %>%
str_split(" ")
#> [[1]]
#> [1] "The" "birch" "canoe" "slid" "on" "the"
#> [7] "smooth" "planks."
#>
#> [[2]]
#> [1] "Glue" "the" "sheet" "to"
#> [5] "the" "dark" "blue" "background."
#>
#> [[3]]
#> [1] "It's" "easy" "to" "tell" "the" "depth" "of"
#> [8] "a" "well."
#>
#> [[4]]
#> [1] "These" "days" "a" "chicken" "leg" "is"
#> [7] "a" "rare" "dish."
#>
#> [[5]]
#> [1] "Rice" "is" "often" "served" "in" "round"

定位匹配内容

str_locate()函数可以给出每个匹配的开始位置和结束位置。

x <- c("apple", "banana", "pear")
str_locate(x,"a")
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-02-06，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

正则表达式

本文分享自庄闪闪的R语言手册微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

编程算法

正则表达式

登录后参与评论

0 条评论

热度