我有一个两列的Dataframe。第一列是识别号,第二列是化合物。然而,第2栏中的化合物通常是重复的(同一化合物的不同形式)。我想删除所有的副本,除了化合物的简单形式。
这是数据帧:
>NISTSpecR
NIST NAME
366620 Formic acid, TMS derivative
366765 2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative
342340 Acetic acid, TMS derivative
352374 Propanoic acid, TMS derivative
333858 Butyric Acid, TMS derivative
352377 Pentanoic acid, TMS derivative
24239 Hexanoic acid, TMS derivative
333733 Heptanoic acid, TMS derivative
352455 Oxalic acid, 2TMS derivative
414056 Succinic acid, monoethyl ester-, (TMS)
332809 Adipic acid, TMS derivative
30799 Pimelic acid, 2TMS derivative
292699 Suberic acid, 2TMS derivative
333874 Citric acid, 4TMS derivative
366657 Citric acid, 3TMS derivative
333513 (-)-Epinephrine, 3TMS derivative
16985 Epinephrine, (.beta.)-, 3TMS derivative
24795 Norepinephrine, (R)-, 5TMS derivative
332935 DL-Norepinephrine, 4TMS derivative
下面是它的结构:
> str(NISTSpecR)
'data.frame': 154 obs. of 3 variables:
$ Spec: Factor w/ 239429 levels "1 0; 13 2; 14 27; 15 239; 16 3; 18 2; 26 3; 27 36; 28 32; 29 113; 30 9; 31 64; 32 9; 33 17; 34 17; 35 20; 36 1; 37 1; 41 8; 42 "| __truncated__,..: 23720 32791 3011 32175 12349 29069 193166 26108 28713 73845 ...
$ NIST: chr "366620" "366765" "342340" "352374" ...
$ NAME: Factor w/ 239430 levels "-4'-Dimethylamino-2'-(trimethylsilyl)acetanilide",..: 157152 39442 108436 210392 133148 199151 169386 168243 195800 229235 ...
我希望最终结果看起来像这样:
>NISTSpecR
NIST NAME
366620 Formic acid, TMS derivative
342340 Acetic acid, TMS derivative
352374 Propanoic acid, TMS derivative
333858 Butyric Acid, TMS derivative
352377 Pentanoic acid, TMS derivative
24239 Hexanoic acid, TMS derivative
333733 Heptanoic acid, TMS derivative
352455 Oxalic acid, 2TMS derivative
414056 Succinic acid, monoethyl ester-, (TMS)
332809 Adipic acid, TMS derivative
30799 Pimelic acid, 2TMS derivative
292699 Suberic acid, 2TMS derivative
366657 Citric acid, 3TMS derivative
333513 (-)-Epinephrine, 3TMS derivative
24795 Norepinephrine, (R)-, 5TMS derivative
每种母体化合物中只有一种(即甲酸,...)。并且它需要是最简单的版本(字符数最少的版本)。
> dput(as.character(NISTSpecR$NAME))
c("Formic acid, TMS derivative", "2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative",
"Acetic acid, TMS derivative", "Propanoic acid, TMS derivative",
"Butyric Acid, TMS derivative", "Pentanoic acid, TMS derivative",
"Hexanoic acid, TMS derivative", "Heptanoic acid, TMS derivative",
"Oxalic acid, 2TMS derivative", "Succinic acid, monoethyl ester-, (TMS)",
"Adipic acid, TMS derivative", "Pimelic acid, 2TMS derivative",
"Suberic acid, 2TMS derivative", "Citric acid, 4TMS derivative",
"Citric acid, 3TMS derivative", "Citric acid 3TMS", "Citric acid, ethyl ester, tri-TMS",
"Isocitric acid lactone, 2TMS derivative", "Glyoxylic acid, di-TMS",
"Pyruvic acid, TMS derivative", "Malic acid, 2TMS derivative",
"Malic acid 1-ethyl ester, 2TMS", "Malic acid, 4-ethyl ester, 2TMS",
"Malic acid, 3TMS derivative", "4-Hydroxybutanoic acid, 2TMS derivative",
"Prostaglandin A1, 2TMS derivative", "Prostaglandin A2, 2TMS derivative",
"Prostaglandin E2, 3TMS", "D-Arabinose, 4TMS derivative", "D-Xylose, 4TMS derivative",
"D-Lyxose, 4TMS derivative", "D-Ribose, 4TMS derivative", "D-Glucose, 5TMS derivative",
"D-Galactose, 5TMS derivative", "D-Mannose, 5TMS derivative",
"D-Allose, oxime (isomer 1), 6TMS derivative", "D-Allose, oxime (isomer 2), 6TMS derivative",
"D-Altrose, 5TMS derivative", "Dihydroxyacetone, 2TMS derivative",
"1,3-Dihydroxyacetone dimer, 4TMS derivative", "D-Fructose, 5TMS derivative",
“D-多糖,5TMS衍生物”,“七糖,6TMS衍生物”,“D-2-脱氧核糖,3TMS衍生物”,“2-脱氧核糖,3TMS衍生物”,“L-岩藻糖,4TMS衍生物”,“L-鼠李糖,(R,R,S,S)-,4TMS衍生物”,“L-鼠李糖,4TMS衍生物”,“N-乙酰-D-氨基葡萄糖,4TMS衍生物”,“D-葡萄糖酸,6TMS衍生物”,“单甘酯,2TMS衍生物”,“2-月桂酸甘油,2TMS衍生物”2TMS衍生物“,”甘油,3TMS衍生物“,”木糖醇,5TMS衍生物“,”D-山梨醇,6TMS衍生物“,”D-甘露醇,6TMS衍生物“,”蔗糖,8TMS衍生物“,”D-乳糖,(异构体1),8TMS衍生物“,”β-D-乳糖,(异构体1),8TMS衍生物“,”D-乳糖,(异构体2),8TMS衍生物“,”β-D-乳糖,(异构体2),8TMS衍生物“,“α-D-乳糖,8TMS衍生物”,“α-D-乳糖,8TMS衍生物”,“β-乳糖,8TMS衍生物”,“乳糖,8TMS衍生物”,“麦芽糖,8TMS衍生物,异构体1",”麦芽糖,8TMS衍生物,异构体2",“麦芽糖,8TMS衍生物”,“D-海藻糖,7TMS衍生物”,“麦二糖,8TMS衍生物”,“L-鸟氨酸,3TMS衍生物”,“DL-鸟氨酸,3TMS衍生物”,“DL-鸟氨酸,4TMS衍生物”,“L-鸟氨酸,4TMS衍生物”,“L-高丝氨酸,2TMS衍生物”,“L-瓜氨酸,3TMS衍生物”,“3-碘-L-酪氨酸,3TMS衍生物”,“3-氨基异丁酸,3TMS衍生物”,“3-氨基异丁酸,3TMS衍生物”,“3-氨基异丁酸,2TMS衍生物”,“D-异亮氨酸,N-乙酰基,TMS衍生物”,“L-羟脯氨酸,(E)-,2TMS衍生物”,“L-羟脯氨酸,(E)-,3TMS衍生物”,“羟脯氨酸,3TMS衍生物”,“3-羟脯氨酸,3TMS衍生物”,“L-胱氨酸,4TMS衍生物”,“乙醇胺,3TMS衍生物”,“乙醇胺,2TMS衍生物”,“3-氨基丙醇,TMS衍生物”,“腐胺,4TMS衍生物”,“组胺,2TMS衍生物”,“组胺,3TMS衍生物”,“多巴胺,4TMS衍生物”,“多巴胺,3TMS衍生物”,“血清素,4TMS衍生物”,"Tyramine,3TMS衍生物“,"Tyramine,TMS衍生物”,"Tyramine,2TMS衍生物“,”苯乙胺,2TMS衍生物“,”1-苯乙胺,TMS衍生物“,”苯乙胺,TMS衍生物“,”生物素,3TMS衍生物“,”16.beta,17.alpha-雌三醇,3TMS衍生物“,”雌三醇,3TMS衍生物“,”16.alpha,17.alpha-Estriol,3TMS衍生物“,”16.beta,17.beta-Estriol,3TMS衍生物“,“雌酮,TMS衍生物”,“16-雌酮,TMS衍生物”,“雌酮,O-甲氧酮,TMS衍生物”,“马术林,TMS衍生物”,“马钱素,(14β)-,TMS衍生物”,“马钱素,TMS衍生物”,“2-羟基雌二醇,3TMS衍生物”,“雄酮,(E)-,TMS衍生物”,“脱氢表雄酮,(E)-,TMS衍生物”,“5.beta-二氢睾酮,TMS衍生物”,“5.α-二氢睾酮,TMS衍生物”,“睾酮O-甲氧酮,TMS衍生物”,“睾酮,TMS衍生物”,“孕烯醇酮,TMS衍生物”,“醛固酮,2TMS衍生物”,“醛固酮,N-甲氧基-三- TMS衍生物”,“皮质酮,二(O-甲氧酮)”,“脱氧胆酸,2TMS衍生物”,“脱氧胆酸,3TMS衍生物”,“石胆酸,2TMS衍生物”,“胆固醇,TMS衍生物”,"Desmosterol,TMS衍生物“,”麦角甾醇,TMS衍生物“,”油菜素,TMS衍生物“,”岩藻甾醇,TMS衍生物“,”豆甾醇,TMS衍生物“,”豆甾醇,TMS衍生物“,”11-脱氧皮质醇,二(O-甲氧基)酮“,”褪黑素,2TMS衍生物“,”肾上腺素,4TMS衍生物“,”L-肾上腺素,4TMS衍生物“,”甘氨酸,3TMS衍生物“,”甘氨酸,TMS衍生物“,”甘氨酸,2TMS衍生物“,”天冬氨酸,3TMS衍生物“,”L-天冬氨酸,3TMS衍生物“,“L-天冬氨酸,2TMS衍生物”,“L-谷氨酸,3TMS衍生物”,"(-)-Epinephrine,3TMS衍生物“,”肾上腺素,(.beta.)-,3TMS衍生物“,"(-)-Epinephrine,4TMS衍生物”,“去甲肾上腺素,(R)-,5TMS衍生物”,“DL-去甲肾上腺素,4TMS衍生物”,“去甲肾上腺素,(R)-,4TMS衍生物”,“环丝氨酸,3TMS衍生物”,“环己亚胺,2TMS衍生物”,“氯霉素,2TMS衍生物“,”氯霉素,3TMS衍生物“)
谢谢。
发布于 2016-08-06 00:14:26
根据您的编辑,我完成了以下操作:首先,提取带有匹配后缀的词语
parents <- extract_indices <- str_split(nist, ",") %>%
lapply(str_extract, "[A-z][a-z]+(ine|ol|in|ose|ic|one|ide)")
然后,由于这些单词中的一些具有多个逗号,因此将非NA值的出现提取到列表extract_indices
,并将在每个列表元素中出现的索引保存到向量indvec
extract_indices <- parents %>%
lapply(function(x) which(!is.na(x)))
indvec <- do.call("c",extract_indices)
然后循环遍历父元素,并为每个列表元素提取父化合物出现的向量。
answer <- sapply(seq_along(parents),
function(i){
parents[[i]][indvec][i]
})
answer
[1] "Formic" "Acetic" "Acetic" "Propanoic" "Butyric"
[6] "Pentanoic" "Hexanoic" "Heptanoic" "Oxalic" "Succinic"
[11] "Adipic" "Pimelic" "Suberic" "Citric" "Citric"
[16] "Citric" "Citric" "Isocitric" "Glyoxylic" "Pyruvic"
[21] "Malic" "Malic" "Malic" "Malic" "Hydroxybutanoic"
[26] "Prostaglandin" "Prostaglandin" "Prostaglandin" "Arabinose" "Xylose"
[31] "Lyxose" "Ribose" "Glucose" "Galactose" "Mannose"
[36] "Allose" "Allose" "Altrose" "Dihydroxyacetone" "Dihydroxyacetone"
[41] "Fructose" "Psicose" "Sedoheptulose" "Deoxyribose" "Deoxyribose"
[46] "Fucose" "Rhamnose" "Rhamnose" "glucosamine" "Gluconic"
[51] "Glycerol" "Glycerol" "Glycerol" "Xylitol" "Sorbitol"
就像这样继续下去...
现在,考虑到您只想要每个匹配中最短的一个,首先计算原始数据集中的字符数,然后对于每个简短答案匹配,从原始数据中选择具有最短字符的一个。
nchar_parent <- nchar(nist)
final <- vector(mode = "character", length(nist))
for(i in seq_along(nist)){
temp_matches <- which(match(answer,answer[i])==TRUE)
shortest <- temp_matches[which.min(nchar_parent[temp_matches])]
final[i] <- nist[shortest]
}
您的最终答案如下所示
[1] "Formic acid, TMS derivative" "Acetic acid, TMS derivative"
[3] "Acetic acid, TMS derivative" "Propanoic acid, TMS derivative"
[5] "Butyric Acid, TMS derivative" "Pentanoic acid, TMS derivative"
[7] "Hexanoic acid, TMS derivative" "Heptanoic acid, TMS derivative"
[9] "Oxalic acid, 2TMS derivative" "Succinic acid, monoethyl ester-, (TMS)"
[11] "Adipic acid, TMS derivative" "Pimelic acid, 2TMS derivative"
[13] "Suberic acid, 2TMS derivative" "Citric acid 3TMS"
[15] "Citric acid 3TMS" "Citric acid 3TMS"
[17] "Citric acid 3TMS" "Isocitric acid lactone, 2TMS derivative"
[19] "Glyoxylic acid, di-TMS" "Pyruvic acid, TMS derivative"
[21] "Malic acid, 2TMS derivative" "Malic acid, 2TMS derivative"
[23] "Malic acid, 2TMS derivative" "Malic acid, 2TMS derivative"
发布于 2016-08-06 00:02:27
如果您只需要第二列的第一部分(在逗号之前),您可以使用split函数,该函数将第二列划分为多列;在此操作之后,您需要此结果的第一列;在此之后,可以根据计算的列删除df的重复条目;最后一条指令删除(可选)第二列的第一部分。
df$foo <- data.frame(do.call('rbind', strsplit(as.character(df$NAME),',',fixed=TRUE)))[,1]#split values
df<-df[!duplicated(df$foo),]
df<-df[,-3]
https://stackoverflow.com/questions/38792825
复制相似问题