我有一个多行的.txt,如下所示:
> X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033699.1|5|1),EXON(MODIFIER|||||FMR1||CODING|NR_033700.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|516|FMR1||CODING|NM_001185081.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|537|FMR1||CODING|NM_001185075.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|586|FMR1||CODING|NM_001185082.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|611|FMR1||CODING|NM_001185076.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|632|FMR1||CODING|NM_002024.5|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
> X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033699.1|5|1),EXON(MODIFIER|||||FMR1||CODING|NR_033700.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|516|FMR1||CODING|NM_001185081.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|537|FMR1||CODING|NM_001185075.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|586|FMR1||CODING|NM_001185082.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|611|FMR1||CODING|NM_001185076.1|5|1),NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|632|FMR1||CODING|NM_002024.5|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
每个字段都是制表符分隔的,第四个字段包含由逗号分隔的多个信息。我知道我可以和tr , '\n'
一起分享:
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033699.1|5|1)
EXON(MODIFIER|||||FMR1||CODING|NR_033700.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|516|FMR1||CODING|NM_001185081.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|537|FMR1||CODING|NM_001185075.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|586|FMR1||CODING|NM_001185082.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|611|FMR1||CODING|NM_001185076.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|632|FMR1||CODING|NM_002024.5|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033699.1|5|1)
EXON(MODIFIER|||||FMR1||CODING|NR_033700.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|516|FMR1||CODING|NM_001185081.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|537|FMR1||CODING|NM_001185075.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|586|FMR1||CODING|NM_001185082.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|611|FMR1||CODING|NM_001185076.1|5|1)
NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|632|FMR1||CODING|NM_002024.5|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
但我想要的是:
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033699.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033700.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|516|FMR1||CODING|NM_001185081.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|537|FMR1||CODING|NM_001185075.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|586|FMR1||CODING|NM_001185082.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|611|FMR1||CODING|NM_001185076.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|632|FMR1||CODING|NM_002024.5|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033699.1|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033700.1|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|516|FMR1||CODING|NM_001185081.1|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|537|FMR1||CODING|NM_001185075.1|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|586|FMR1||CODING|NM_001185082.1|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|611|FMR1||CODING|NM_001185076.1|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
X 147010263 SNP NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aaA/aaC|K119N|632|FMR1||CODING|NM_002024.5|5|1) NA NA 13829.p1 A/A 46 A/A 83 A/C 17 A/C 0
请注意,这条线的开头(X 147010263,它的染色体位置)也可能不同,例如3 41278119,4 114275304
我怎样才能做到这一点?
谢谢!
发布于 2020-07-09 09:41:18
纯bash的解决方案可以是:
#!/bin/bash
while IFS=$'\t' read -r f1 f2 f3 f4 rest; do
IFS=, read -r -a items <<< "$f4"
for item in "${items[@]}"; do
printf "%s\t%s\t%s\t%s\t%s\n" "$f1" "$f2" "$f3" "$item" "$rest"
done
done < input.txt
解释:
外部while
循环读取行,直到遇到文件结束为止。IFS=$'\t'
指示read
内置字符使用选项卡字符作为正在处理的行的字段分隔符。前四个字段分别分配给变量f1
、f2
、f3
和f4
。剩余的字段和中间的制表符字符(如果有的话)分配给变量rest
(在这里,变量名不是特殊的)。任何有效名称都可以使用)。-r
选项用于read
内置,因此反斜杠不充当转义字符。
在while
循环的主体中,read
内置器读取变量f4
的内容,该变量存储正在处理的行的第四个字段,将其拆分为使用,
作为分隔符的字段,并将字段分配给数组items
的顺序索引(由-a
选项指示)。构造command <<< string
称为here string
(在Bash参考手册中读取这里的Strings )。
内部for
循环(有时称为-每个循环)依次处理数组items
的每个元素。"${items[@]}"
将数组items
的每个元素扩展到一个单独的字段,并且按顺序将字段分配给变量item
。printf
内置与C标准库类似。
发布于 2020-07-09 10:19:48
使用awk。请注意,行的开头--X 147010263 --我假设记录不像样本数据所显示的那样以>
开头。
$ awk '
BEGIN {
FS=OFS="\t" # tab-delimitied
}
{
n=split($4,a,/,/) # split the 4th by commas
for(i=1;i<=n;i++) # for all comps of 4th
for(j=1;j<=NF;j++) # and all fields
printf "%s%s",(j==4?a[i]:$j),(j==NF?ORS:OFS) # output
}' file
输出的开始:
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033699.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
X 147010263 SNP EXON(MODIFIER|||||FMR1||CODING|NR_033700.1|5|1) NA 11161.p1 NA A/A 77 A/A 87 A/C 97 A/C 0
...
https://stackoverflow.com/questions/62811195
复制相似问题