首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >从R数据帧中删除准重复项

从R数据帧中删除准重复项
EN

Stack Overflow用户
提问于 2016-08-05 23:26:31
回答 2查看 103关注 0票数 0

我有一个两列的Dataframe。第一列是识别号,第二列是化合物。然而,第2栏中的化合物通常是重复的(同一化合物的不同形式)。我想删除所有的副本,除了化合物的简单形式。

这是数据帧:

代码语言:javascript
运行
复制
>NISTSpecR

     NIST                                                     NAME
   366620                              Formic acid, TMS derivative
   366765 2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative
   342340                              Acetic acid, TMS derivative
   352374                           Propanoic acid, TMS derivative
   333858                             Butyric Acid, TMS derivative
   352377                           Pentanoic acid, TMS derivative
    24239                            Hexanoic acid, TMS derivative
   333733                           Heptanoic acid, TMS derivative
   352455                             Oxalic acid, 2TMS derivative
   414056                   Succinic acid, monoethyl ester-, (TMS)
   332809                              Adipic acid, TMS derivative
    30799                            Pimelic acid, 2TMS derivative
   292699                            Suberic acid, 2TMS derivative
   333874                             Citric acid, 4TMS derivative
   366657                             Citric acid, 3TMS derivative
   333513                         (-)-Epinephrine, 3TMS derivative
    16985                  Epinephrine, (.beta.)-, 3TMS derivative
    24795                    Norepinephrine, (R)-, 5TMS derivative
   332935                       DL-Norepinephrine, 4TMS derivative

下面是它的结构:

代码语言:javascript
运行
复制
> str(NISTSpecR)

'data.frame':   154 obs. of  3 variables:
 $ Spec: Factor w/ 239429 levels "1 0; 13 2; 14 27; 15 239; 16 3; 18 2; 26 3; 27 36; 28 32; 29 113; 30 9; 31 64; 32 9; 33 17; 34 17; 35 20; 36 1; 37 1; 41 8; 42 "| __truncated__,..: 23720 32791 3011 32175 12349 29069 193166 26108 28713 73845 ...
 $ NIST: chr  "366620" "366765" "342340" "352374" ...
 $ NAME: Factor w/ 239430 levels "-4'-Dimethylamino-2'-(trimethylsilyl)acetanilide",..: 157152 39442 108436 210392 133148 199151 169386 168243 195800 229235 ...

我希望最终结果看起来像这样:

代码语言:javascript
运行
复制
>NISTSpecR

     NIST                                                     NAME
   366620                              Formic acid, TMS derivative
   342340                              Acetic acid, TMS derivative
   352374                           Propanoic acid, TMS derivative
   333858                             Butyric Acid, TMS derivative
   352377                           Pentanoic acid, TMS derivative
    24239                            Hexanoic acid, TMS derivative
   333733                           Heptanoic acid, TMS derivative
   352455                             Oxalic acid, 2TMS derivative
   414056                   Succinic acid, monoethyl ester-, (TMS)
   332809                              Adipic acid, TMS derivative
    30799                            Pimelic acid, 2TMS derivative
   292699                            Suberic acid, 2TMS derivative
   366657                             Citric acid, 3TMS derivative
   333513                         (-)-Epinephrine, 3TMS derivative
    24795                    Norepinephrine, (R)-, 5TMS derivative

每种母体化合物中只有一种(即甲酸,...)。并且它需要是最简单的版本(字符数最少的版本)。

代码语言:javascript
运行
复制
> dput(as.character(NISTSpecR$NAME))

c("Formic acid, TMS derivative", "2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative", 
"Acetic acid, TMS derivative", "Propanoic acid, TMS derivative", 
"Butyric Acid, TMS derivative", "Pentanoic acid, TMS derivative", 
"Hexanoic acid, TMS derivative", "Heptanoic acid, TMS derivative", 
"Oxalic acid, 2TMS derivative", "Succinic acid, monoethyl ester-, (TMS)", 
"Adipic acid, TMS derivative", "Pimelic acid, 2TMS derivative", 
"Suberic acid, 2TMS derivative", "Citric acid, 4TMS derivative", 
"Citric acid, 3TMS derivative", "Citric acid 3TMS", "Citric acid, ethyl ester, tri-TMS", 
"Isocitric acid lactone, 2TMS derivative", "Glyoxylic acid, di-TMS", 
"Pyruvic acid, TMS derivative", "Malic acid, 2TMS derivative", 
"Malic acid 1-ethyl ester, 2TMS", "Malic acid, 4-ethyl ester, 2TMS", 
"Malic acid, 3TMS derivative", "4-Hydroxybutanoic acid, 2TMS derivative", 
"Prostaglandin A1, 2TMS derivative", "Prostaglandin A2, 2TMS derivative", 
"Prostaglandin E2, 3TMS", "D-Arabinose, 4TMS derivative", "D-Xylose, 4TMS derivative", 
"D-Lyxose, 4TMS derivative", "D-Ribose, 4TMS derivative", "D-Glucose, 5TMS derivative", 
"D-Galactose, 5TMS derivative", "D-Mannose, 5TMS derivative", 
"D-Allose, oxime (isomer 1), 6TMS derivative", "D-Allose, oxime (isomer 2), 6TMS derivative", 
"D-Altrose, 5TMS derivative", "Dihydroxyacetone, 2TMS derivative", 
"1,3-Dihydroxyacetone dimer, 4TMS derivative", "D-Fructose, 5TMS     derivative", 

“D-多糖,5TMS衍生物”,“七糖,6TMS衍生物”,“D-2-脱氧核糖,3TMS衍生物”,“2-脱氧核糖,3TMS衍生物”,“L-岩藻糖,4TMS衍生物”,“L-鼠李糖,(R,R,S,S)-,4TMS衍生物”,“L-鼠李糖,4TMS衍生物”,“N-乙酰-D-氨基葡萄糖,4TMS衍生物”,“D-葡萄糖酸,6TMS衍生物”,“单甘酯,2TMS衍生物”,“2-月桂酸甘油,2TMS衍生物”2TMS衍生物“,”甘油,3TMS衍生物“,”木糖醇,5TMS衍生物“,”D-山梨醇,6TMS衍生物“,”D-甘露醇,6TMS衍生物“,”蔗糖,8TMS衍生物“,”D-乳糖,(异构体1),8TMS衍生物“,”β-D-乳糖,(异构体1),8TMS衍生物“,”D-乳糖,(异构体2),8TMS衍生物“,”β-D-乳糖,(异构体2),8TMS衍生物“,“α-D-乳糖,8TMS衍生物”,“α-D-乳糖,8TMS衍生物”,“β-乳糖,8TMS衍生物”,“乳糖,8TMS衍生物”,“麦芽糖,8TMS衍生物,异构体1",”麦芽糖,8TMS衍生物,异构体2",“麦芽糖,8TMS衍生物”,“D-海藻糖,7TMS衍生物”,“麦二糖,8TMS衍生物”,“L-鸟氨酸,3TMS衍生物”,“DL-鸟氨酸,3TMS衍生物”,“DL-鸟氨酸,4TMS衍生物”,“L-鸟氨酸,4TMS衍生物”,“L-高丝氨酸,2TMS衍生物”,“L-瓜氨酸,3TMS衍生物”,“3-碘-L-酪氨酸,3TMS衍生物”,“3-氨基异丁酸,3TMS衍生物”,“3-氨基异丁酸,3TMS衍生物”,“3-氨基异丁酸,2TMS衍生物”,“D-异亮氨酸,N-乙酰基,TMS衍生物”,“L-羟脯氨酸,(E)-,2TMS衍生物”,“L-羟脯氨酸,(E)-,3TMS衍生物”,“羟脯氨酸,3TMS衍生物”,“3-羟脯氨酸,3TMS衍生物”,“L-胱氨酸,4TMS衍生物”,“乙醇胺,3TMS衍生物”,“乙醇胺,2TMS衍生物”,“3-氨基丙醇,TMS衍生物”,“腐胺,4TMS衍生物”,“组胺,2TMS衍生物”,“组胺,3TMS衍生物”,“多巴胺,4TMS衍生物”,“多巴胺,3TMS衍生物”,“血清素,4TMS衍生物”,"Tyramine,3TMS衍生物“,"Tyramine,TMS衍生物”,"Tyramine,2TMS衍生物“,”苯乙胺,2TMS衍生物“,”1-苯乙胺,TMS衍生物“,”苯乙胺,TMS衍生物“,”生物素,3TMS衍生物“,”16.beta,17.alpha-雌三醇,3TMS衍生物“,”雌三醇,3TMS衍生物“,”16.alpha,17.alpha-Estriol,3TMS衍生物“,”16.beta,17.beta-Estriol,3TMS衍生物“,“雌酮,TMS衍生物”,“16-雌酮,TMS衍生物”,“雌酮,O-甲氧酮,TMS衍生物”,“马术林,TMS衍生物”,“马钱素,(14β)-,TMS衍生物”,“马钱素,TMS衍生物”,“2-羟基雌二醇,3TMS衍生物”,“雄酮,(E)-,TMS衍生物”,“脱氢表雄酮,(E)-,TMS衍生物”,“5.beta-二氢睾酮,TMS衍生物”,“5.α-二氢睾酮,TMS衍生物”,“睾酮O-甲氧酮,TMS衍生物”,“睾酮,TMS衍生物”,“孕烯醇酮,TMS衍生物”,“醛固酮,2TMS衍生物”,“醛固酮,N-甲氧基-三- TMS衍生物”,“皮质酮,二(O-甲氧酮)”,“脱氧胆酸,2TMS衍生物”,“脱氧胆酸,3TMS衍生物”,“石胆酸,2TMS衍生物”,“胆固醇,TMS衍生物”,"Desmosterol,TMS衍生物“,”麦角甾醇,TMS衍生物“,”油菜素,TMS衍生物“,”岩藻甾醇,TMS衍生物“,”豆甾醇,TMS衍生物“,”豆甾醇,TMS衍生物“,”11-脱氧皮质醇,二(O-甲氧基)酮“,”褪黑素,2TMS衍生物“,”肾上腺素,4TMS衍生物“,”L-肾上腺素,4TMS衍生物“,”甘氨酸,3TMS衍生物“,”甘氨酸,TMS衍生物“,”甘氨酸,2TMS衍生物“,”天冬氨酸,3TMS衍生物“,”L-天冬氨酸,3TMS衍生物“,“L-天冬氨酸,2TMS衍生物”,“L-谷氨酸,3TMS衍生物”,"(-)-Epinephrine,3TMS衍生物“,”肾上腺素,(.beta.)-,3TMS衍生物“,"(-)-Epinephrine,4TMS衍生物”,“去甲肾上腺素,(R)-,5TMS衍生物”,“DL-去甲肾上腺素,4TMS衍生物”,“去甲肾上腺素,(R)-,4TMS衍生物”,“环丝氨酸,3TMS衍生物”,“环己亚胺,2TMS衍生物”,“氯霉素,2TMS衍生物“,”氯霉素,3TMS衍生物“)

谢谢。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-08-06 00:14:26

根据您的编辑,我完成了以下操作:首先,提取带有匹配后缀的词语

代码语言:javascript
运行
复制
parents <- extract_indices <- str_split(nist, ",") %>% 
  lapply(str_extract, "[A-z][a-z]+(ine|ol|in|ose|ic|one|ide)")

然后,由于这些单词中的一些具有多个逗号,因此将非NA值的出现提取到列表extract_indices,并将在每个列表元素中出现的索引保存到向量indvec

代码语言:javascript
运行
复制
extract_indices <- parents %>% 
  lapply(function(x) which(!is.na(x)))
indvec <- do.call("c",extract_indices)

然后循环遍历父元素,并为每个列表元素提取父化合物出现的向量。

代码语言:javascript
运行
复制
answer <- sapply(seq_along(parents),
       function(i){
         parents[[i]][indvec][i]
       })

   answer

  [1] "Formic"                 "Acetic"                 "Acetic"                 "Propanoic"              "Butyric"               
  [6] "Pentanoic"              "Hexanoic"               "Heptanoic"              "Oxalic"                 "Succinic"              
 [11] "Adipic"                 "Pimelic"                "Suberic"                "Citric"                 "Citric"                
 [16] "Citric"                 "Citric"                 "Isocitric"              "Glyoxylic"              "Pyruvic"               
 [21] "Malic"                  "Malic"                  "Malic"                  "Malic"                  "Hydroxybutanoic"       
 [26] "Prostaglandin"          "Prostaglandin"          "Prostaglandin"          "Arabinose"              "Xylose"                
 [31] "Lyxose"                 "Ribose"                 "Glucose"                "Galactose"              "Mannose"               
 [36] "Allose"                 "Allose"                 "Altrose"                "Dihydroxyacetone"       "Dihydroxyacetone"      
 [41] "Fructose"               "Psicose"                "Sedoheptulose"          "Deoxyribose"            "Deoxyribose"           
 [46] "Fucose"                 "Rhamnose"               "Rhamnose"               "glucosamine"            "Gluconic"              
 [51] "Glycerol"               "Glycerol"               "Glycerol"               "Xylitol"                "Sorbitol"              

就像这样继续下去...

现在,考虑到您只想要每个匹配中最短的一个,首先计算原始数据集中的字符数,然后对于每个简短答案匹配,从原始数据中选择具有最短字符的一个。

代码语言:javascript
运行
复制
nchar_parent <- nchar(nist)
final <- vector(mode = "character", length(nist))
for(i in seq_along(nist)){
  temp_matches <- which(match(answer,answer[i])==TRUE)
  shortest <- temp_matches[which.min(nchar_parent[temp_matches])]
  final[i] <- nist[shortest]
}

您的最终答案如下所示

代码语言:javascript
运行
复制
[1] "Formic acid, TMS derivative"                  "Acetic acid, TMS derivative"                 
  [3] "Acetic acid, TMS derivative"                  "Propanoic acid, TMS derivative"              
  [5] "Butyric Acid, TMS derivative"                 "Pentanoic acid, TMS derivative"              
  [7] "Hexanoic acid, TMS derivative"                "Heptanoic acid, TMS derivative"              
  [9] "Oxalic acid, 2TMS derivative"                 "Succinic acid, monoethyl ester-, (TMS)"      
 [11] "Adipic acid, TMS derivative"                  "Pimelic acid, 2TMS derivative"               
 [13] "Suberic acid, 2TMS derivative"                "Citric acid 3TMS"                            
 [15] "Citric acid 3TMS"                             "Citric acid 3TMS"                            
 [17] "Citric acid 3TMS"                             "Isocitric acid lactone, 2TMS derivative"     
 [19] "Glyoxylic acid, di-TMS"                       "Pyruvic acid, TMS derivative"                
 [21] "Malic acid, 2TMS derivative"                  "Malic acid, 2TMS derivative"                 
 [23] "Malic acid, 2TMS derivative"                  "Malic acid, 2TMS derivative"       
票数 1
EN

Stack Overflow用户

发布于 2016-08-06 00:02:27

如果您只需要第二列的第一部分(在逗号之前),您可以使用split函数,该函数将第二列划分为多列;在此操作之后,您需要此结果的第一列;在此之后,可以根据计算的列删除df的重复条目;最后一条指令删除(可选)第二列的第一部分。

代码语言:javascript
运行
复制
df$foo <- data.frame(do.call('rbind', strsplit(as.character(df$NAME),',',fixed=TRUE)))[,1]#split values
df<-df[!duplicated(df$foo),]
df<-df[,-3]
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/38792825

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档