我一直试图编写两个正则表达式来执行以下两项任务:
之后
我希望将数字存储在一个名为“类别”的列中,而将单词存储在“诊断”中。
字符串位于列名"GROUPER_NAME“中。
df <- structure(list(GROUPER_ID = structure(c("9001742130", "9001742138",
"9001742058", "9001742062", "9001742102", "9001742247", "9001742055",
"9001742158", "9001742036", "9001742053"), label = "GROUPER_ID", format.sas = "$"),
GROUPER_NAME = structure(c("EDG ICD HCUP CCS 130 (PREDICTIVE MODELS-VERSION 1.0)-PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE",
"EDG ICD HCUP CCS 138 (PREDICTIVE MODELS-VERSION 1.0)-ESOPHAGEAL DISORDERS",
"EDG ICD HCUP CCS 58 (PREDICTIVE MODELS-VERSION 1.0)-OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS",
"EDG ICD HCUP CCS 62 (PREDICTIVE MODELS-VERSION 1.0)-COAGULATION AND HEMORRHAGIC DISORDERS",
"EDG ICD HCUP CCS 102 (PREDICTIVE MODELS-VERSION 1.0)-NONSPECIFIC CHEST PAIN",
"EDG ICD HCUP CCS 247 (PREDICTIVE MODELS-VERSION 1.0)-LYMPHADENITIS",
"EDG ICD HCUP CCS 55 (PREDICTIVE MODELS-VERSION 1.0)-FLUID AND ELECTROLYTE DISORDERS",
"EDG ICD HCUP CCS 158 (PREDICTIVE MODELS-VERSION 1.0)-CHRONIC KIDNEY DISEASE",
"EDG ICD HCUP CCS 36 (PREDICTIVE MODELS-VERSION 1.0)-CANCER OF THYROID",
"EDG ICD HCUP CCS 53 (PREDICTIVE MODELS-VERSION 1.0)-DISORDERS OF LIPID METABOLISM"
), label = "GROUPER_NAME", format.sas = "$")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))第一个例子,我想把"159“和”尿路感染“分别放在”分类“和”诊断“栏中。我试图修改这里的一些解决方案,以适应我的场景,但我对正则表达式非常糟糕,无法工作。任何帮助都将不胜感激!
发布于 2021-06-17 17:44:53
我们可以使用来自sub的base R。捕获前缀子字符串后的数字(\\d+),以及)和-之后的字符。在替换中,指定捕获组的反向引用(\\1,\\2),并将它们读入带有read.csv的两列data.frame中。
read.csv(text = sub("\\w+ \\w+ \\w+ \\w+ (\\d+)\\s.*\\)-(.*)",
"\\1:\\2", df$GROUPER_NAME), sep = ":", header = FALSE,
col.names = c("category", "diagnosis"))-output
category diagnosis
1 130 PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2 138 ESOPHAGEAL DISORDERS
3 58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4 62 COAGULATION AND HEMORRHAGIC DISORDERS
5 102 NONSPECIFIC CHEST PAIN
6 247 LYMPHADENITIS
7 55 FLUID AND ELECTROLYTE DISORDERS
8 158 CHRONIC KIDNEY DISEASE
9 36 CANCER OF THYROID
10 53 DISORDERS OF LIPID METABOLISM发布于 2021-06-17 18:03:41
现在它已经完成了:我错过了第二部分:现在:
您可以使用pars_number从readr提取数字,并在-之后获得子
library(dplyr)
library(readr)
df %>%
mutate(category=parse_number(GROUPER_NAME), .before=GROUPER_NAME) %>%
mutate(diagnosis= sub(".*-", "", GROUPER_NAME), .keep="unused")输出:
GROUPER_ID category diagnosis
<chr> <dbl> <chr>
1 9001742130 130 PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2 9001742138 138 ESOPHAGEAL DISORDERS
3 9001742058 58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4 9001742062 62 COAGULATION AND HEMORRHAGIC DISORDERS
5 9001742102 102 NONSPECIFIC CHEST PAIN
6 9001742247 247 LYMPHADENITIS
7 9001742055 55 FLUID AND ELECTROLYTE DISORDERS
8 9001742158 158 CHRONIC KIDNEY DISEASE
9 9001742036 36 CANCER OF THYROID
10 9001742053 53 DISORDERS OF LIPID METABOLISM https://stackoverflow.com/questions/68024208
复制相似问题