文章/答案/技术大牛

发布

社区首页 >问答首页 >利用Bash从fasta序列中提取每个基因的位置2-7

问利用Bash从fasta序列中提取每个基因的位置2-7
EN

Stack Overflow用户

提问于 2020-10-19 12:05:55

回答 3查看 505关注 0票数 2

我有一个文件，里面有一个基因it的子集，还有一个fasta文件，里面有所有的基因it和它们的序列。对于子集文件中的每个基因，我想从每个fasta序列开始得到2-7个位置。理想情况下，输出文件应该是'pos 2-7‘'\t’geneID‘。

示例子集：

mmu-let-7g-5p MIMAT0000121  
mmu-let-7i-5p MIMAT0000122

Fasta档案：

>mmu-let-7g-5p MIMAT0000121 
UGAGGUAGUAGUUUGUACAGUU
>mmu-let-7i-5p MIMAT0000122 
UGAGGUAGUAGUUUGUGCUGUU
>mmu-let-7f-5p MIMAT0000525 
UGAGGUAGUAGAUUGUAUAGUU

想要的产出：

GAGGUA   mmu-let-7g-5p MIMAT0000121
GAGGUA   mmu-let-7i-5p MIMAT0000122

第一部分(提取fasta序列的基因子集)，我已经做了使用grep -w -A 1 -f。不知道如何获得pos-2-7，并使输出看起来像现在使用Bash。

awk

command-line

fasta

回答 3

Stack Overflow用户

回答已采纳

发布于 2020-10-19 12:13:33

请您尝试以下，编写和测试只在GNU awk中显示的样本。

awk '
FNR==NR{
  a[$1]=$2
  next
}
/^>/{
  ind=substr($1,2)
}
/^>/ && (ind in a){
  found=1
  val=ind OFS a[ind]
  next
}
found{
  print substr($0,2,6) OFS val
  val=found=""
}
' gene fastafile

解释：添加了上面的详细说明。

awk '                               ##Starting awk program from here.
FNR==NR{                            ##Checking condition FNR==NR which will be TRUE when gene Input_file is being read.
  a[$1]=$2                          ##Creating array a with index of $1 and value of $2 here.
  next                              ##next will skip all further statements from here.
}
/^>/{                               ##Checking condition if line starts from > then do following.
  ind=substr($1,2)                  ##Creating ind which has substring from 2nd charcters to all values of first field.
}
/^>/ && (ind in a){                 ##Checking if line starts with > and ind is present in array a then do following.
  found=1                           ##Setting found to 1 here.
  val=ind OFS a[ind]                ##Creating val which has ind OFS and value of a with index of ind.
  next                              ##next will skip all further statements from here.
}
found{                              ##Checking condition if found is NOT NULL then do following.
  print substr($0,2,6) OFS val      ##Printing sub string from 2nd to 7th character OFS and val here.
  val=found=""                      ##Nullifying val and found here.
}
' gene fastafile                    ##Mentioning Input_file names here.

票数 4

Stack Overflow用户

发布于 2020-10-19 12:52:52

使用GNU awk进行测试，但我认为它将适用于任何awk

$ awk 'NR==FNR{a[$0]; next}
       $1 in a{print substr($2, 2, 6), $1}
      ' gene.txt RS='>' FS='\n' OFS='\t' fasta.txt
GAGGUA  mmu-let-7g-5p MIMAT0000121
GAGGUA  mmu-let-7i-5p MIMAT0000122

将输入记录分隔符设置为awk

RS='>' FS='\n' OFS='\t'，输入字段分隔符设置为换行符，输出字段分隔符仅设置为第二个文件的选项卡(由于这些变量在第一个filename)

$1 in a{print substr($2, 2, 7), $1}之后分配，如果第一个字段作为数组a中的键存在，则打印所需的详细信息

)。

如果行尾有尾随空格字符，请使用：

$ awk 'NR==FNR{sub(/[[:space:]]+$/, ""); a[$0]; next}
       $1 in a{print substr($2, 2, 6), $1}
      ' gene.txt RS='>' FS='[[:space:]]*\n' OFS='\t' fasta.txt

票数 5

Stack Overflow用户

发布于 2020-10-19 12:43:15

另一只鹰：

$ awk '
{
    gsub(/ +$/,"")                 # clean trailing space from sample data 
}
NR==FNR {                          # process subset file as it is smaller
    a[$0]                          # hash keys
    next                        
}                                  # process fasta file
/^>/ && ((p=substr($0,2)) in a) {  # if string found in hash
    if(getline>0)                  # read next record
        print substr($0,2,6),p     # and print
}' subset fasta

输出：

GAGGUA mmu-let-7g-5p MIMAT0000121
GAGGUA mmu-let-7i-5p MIMAT0000122

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64427001

复制

相似问题

问利用Bash从fasta序列中提取每个基因的位置2-7
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问利用Bash从fasta序列中提取每个基因的位置2-7EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问利用Bash从fasta序列中提取每个基因的位置2-7
EN