首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >在python中编辑文件并创建一个新文件

在python中编辑文件并创建一个新文件
EN

Stack Overflow用户
提问于 2018-06-11 21:19:56
回答 1查看 66关注 0票数 -2

我有一个很大的文本文件("|“分隔),就像下面这个小例子:

代码语言:javascript
复制
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|RNF14|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|HAS-0|HAS|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL

第一行是以"<"开头的ID行,第二行是属于上述ID行的字符序列。看第六列,有重复的名字,第七个长度是ID (字符序列)之后的行的长度。我想根据第7列选择每个ID行的一个重复,这意味着长度最长的ID。这个小示例的预期输出是:

代码语言:javascript
复制
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL

因此,根据长度(即column 7. ),每个Python行有一次重复(查看column 6)。我在Python语言中尝试了以下代码,但它不起作用。你知道怎么修吗?

代码语言:javascript
复制
from __future__ import print_function
import sys

def parse_fasta(data):
    name, seq = None, []
    for line in data:
        line = line.rstrip()
        if line.startswith('>'):
            if name:
                yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name:
        yield (name, ''.join(seq))

isoforms = dict()
for defline, sequence in parse_fasta(sys.stdin):
    geneid = '.'.join(defline[1:].split('.')[:-1])
    if geneid in isoforms:
        otherdefline, othersequence = isoforms[geneid]
        if len(sequence) > len(othersequence):
            isoforms[geneid] = (defline, sequence)
    else:
        isoforms[geneid] = (defline, sequence)

for defline, sequence in isoforms.values():
    print(defline, sequence, sep='\n')
EN

回答 1

Stack Overflow用户

发布于 2018-06-11 21:36:37

我建议您使用Biopython,而不是构建自己的解析器。注意,我还添加了一个健全性检查(在您的示例中,以474结尾的FASTA标题行实际上只有388长度的序列):

代码语言:javascript
复制
from Bio import SeqIO

def yield_records():
    seen = set()
    for record in SeqIO.parse('in.fa', 'fasta'):
        header_seq_len = int(record.description.split('|')[-1])
        seq_len = len(record)
        if header_seq_len != seq_len:
            print('Warning: the seq length {} != that stated in the header {}'
                   .format(seq_len, header_seq_len))
        if header_seq_len not in seen:
            yield record
            seen.add(header_seq_len)

SeqIO.write(yield_records(), 'out.fa', 'fasta')
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50798709

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档