我有一个很大的文本文件("|“分隔),就像下面这个小例子:
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|RNF14|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|HAS-0|HAS|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL
第一行是以"<"
开头的ID行,第二行是属于上述ID行的字符序列。看第六列,有重复的名字,第七个长度是ID (字符序列)之后的行的长度。我想根据第7列选择每个ID行的一个重复,这意味着长度最长的ID。这个小示例的预期输出是:
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL
因此,根据长度(即column 7.
),每个Python行有一次重复(查看column 6
)。我在Python语言中尝试了以下代码,但它不起作用。你知道怎么修吗?
from __future__ import print_function
import sys
def parse_fasta(data):
name, seq = None, []
for line in data:
line = line.rstrip()
if line.startswith('>'):
if name:
yield (name, ''.join(seq))
name, seq = line, []
else:
seq.append(line)
if name:
yield (name, ''.join(seq))
isoforms = dict()
for defline, sequence in parse_fasta(sys.stdin):
geneid = '.'.join(defline[1:].split('.')[:-1])
if geneid in isoforms:
otherdefline, othersequence = isoforms[geneid]
if len(sequence) > len(othersequence):
isoforms[geneid] = (defline, sequence)
else:
isoforms[geneid] = (defline, sequence)
for defline, sequence in isoforms.values():
print(defline, sequence, sep='\n')
发布于 2018-06-11 21:36:37
我建议您使用Biopython,而不是构建自己的解析器。注意,我还添加了一个健全性检查(在您的示例中,以474
结尾的FASTA标题行实际上只有388
长度的序列):
from Bio import SeqIO
def yield_records():
seen = set()
for record in SeqIO.parse('in.fa', 'fasta'):
header_seq_len = int(record.description.split('|')[-1])
seq_len = len(record)
if header_seq_len != seq_len:
print('Warning: the seq length {} != that stated in the header {}'
.format(seq_len, header_seq_len))
if header_seq_len not in seen:
yield record
seen.add(header_seq_len)
SeqIO.write(yield_records(), 'out.fa', 'fasta')
https://stackoverflow.com/questions/50798709
复制相似问题