问如何在不使用Biopython的情况下找到FASTA数据集中的所有序列长度
EN

Stack Overflow用户

提问于 2021-10-01 20:19:51

回答 3查看 434关注 0票数 1

假设我们有一个这样的FASTA文件：

>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
>header2
ASFECAERVA
>header3
ACTGQSDFWGRWTFSH

这是我想要的输出：

header1 30
header2 10
header3 16

请不使用Biopython回答此问题。我认为这里可以使用re.match('^>')来区分标题行和其他序列行(需要先导入re )，但我需要一些帮助才能让rest部分获得输出。只使用python，不使用Biopython。谢谢!

下面是我当前的代码：

import re
path='/home/try.txt'
count=0
seq=''
with open(path) as d_g:
    for line in d_g:
        if re.match('^>',line):
            count=count+1
            print('head',count)
        else:
            seq=seq+line

python

bioinformatics

fasta

回答 3

Stack Overflow用户

发布于 2021-10-02 01:47:28

你真的不需要正则表达式来做这件事。

header = None
length = 0
with open('file.fasta') as fasta:
    for line in fasta:
        # Trim newline
        line = line.rstrip()
        if line.startswith('>'):
            # If we captured one before, print it now
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
        else:
            length += len(line)
# Don't forget the last one
if length:
    print(header, length)

本文使用了非常基本的Python习惯用法，这些习惯用法在任何Python文本处理入门中都很容易掌握。简而言之，我们记住到目前为止看到的内容，并在开始记住新记录之前打印我们记住的内容(当我们看到标题行时就会发生这种情况)。一个常见的错误是，当我们到达文件的末尾时，忘记打印最后一条记录，当然，它没有新的标题行。

如果你能保证所有的序列都在一行上，这可以从根本上简化，但我想给出一个通用的解决方案。

票数 3

Stack Overflow用户

发布于 2021-10-02 01:48:55

collections.Counter()和一个简单的循环来拯救状态。

我也增加了你的测试数据，使其具有多行序列，这样我们就可以告诉你事情是正确的。

import io
import collections

# Using a stringio to emulate a file
data = io.StringIO("""
>header1
ASDTWQEREWQDDSFADFASDASDQWEFQW
DFASDASDQWEFQWASDTWQEREWQDDSFA
>header2
ASFECAERVA
>header3
>header4
ACTGQSDFWGRWTFSH
""".strip())


counter = collections.Counter()

header = None
for line in data:
    line = line.strip()  # deal with newlines
    if line.startswith(">"):
        header = line[1:]
        continue
    counter[header] += len(line)

print(counter)

这将打印出来

Counter({'header1': 60, 'header4': 16, 'header2': 10})

票数 1

Stack Overflow用户

发布于 2021-10-04 13:28:57

不要重复发明轮子。有许多专门的生物信息学工具可以处理fasta文件。

例如，使用EMBOSS包中的infoseq实用程序：

infoseq -auto -only -name -length -noheading file.fasta

打印：

header1        30     
header2        10     
header3        16

例如，使用conda安装EMBOSS

conda create --name emboss emboss

从手册页：

infoseq -help
   -only               boolean    [N] This is a way of shortening the command
                                  line if you only want a few things to be
                                  displayed. Instead of specifying:
                                  '-nohead -noname -noacc -notype -nopgc
                                  -nodesc'
                                  to get only the length output, you can
                                  specify
                                  '-only -length'
   -[no]heading        boolean    [Y] Display column headings
   -name               boolean    [@(!$(only))] Display 'name' column
   -length             boolean    [@(!$(only))] Display 'length' column