SRA: Sequence Read Archive: It belongs to NCBI (National Center for Biotechnology Information), is a database storing high throughput sequencing (HTS) raw data, alignment information and metadata. Almost all HTS data in published publications will be asked uploading to here, and stored as .sra compressed file format.
ENA: European Nucleotide Archive: It belongs to EBI (European Bioinformatics Institute), although it has the same funtion with SRA, more annotations and friendlier website make it preferable. What’s more, you could download directly fastq.gz files from it.
第1选择--Aspera Connect 如果aspera connect不能下载,推荐sratoolkit的prefetch功能。尽量不要用wget或curl下载,速度慢,且有时下载不完全
-Aspera connect是IBM的商业化高速文件下载软件,但可以免费下载NCBI和EBI的数据。速度可达200-500Mbps,几乎所有站点都超过10Mbps
-如果Aspera connect不能下载,则推荐sratoolkit的prefetch
功能
-最后,尽量使用sratoolkit中的fastq-dump
和sam-dump
命令。如果fastq-dump
连接外部稳定,则推荐使用Biostar Handbook中的wonderdump脚本。
警告:尽量不要使用
wget
或curl
命令来下载
具体可参考这篇文章
首先,goto Aspera connect,选择linux版本,复制链接地址(这个需要代理下载)
wget http://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
#解压缩
tar zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
# install
bash aspera-connect-3.7.4.147727-linux-64.sh
# check the .aspera directory
cd # go to root directory
ls -a # if you could see .aspera, the installation is OK
# add environment variable
echo 'export PATH=~/.aspera/connect/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
#密钥备份到/home/的家目录(后面会用,否则报错)
cp ~/.aspera/connect/etc/asperaweb_id_dsa.openssh ~/
# check help file
ascp --help
若下载sra,依次:sra-sra-instant-reads-ByRun-sra-SRR-SRR###-SRR###$$$-SRR###$$$.sra
比如下载SRR949627.sra文件
ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra instant/reads/ByRun/sra/SRR/SRR949/SRR949627/SRR949627.sra .
数据的存放地址是fasp.sra.ebi.ac.uk,ENA在Aspera的用户名是era-fasp,注意,ena可以直接下载fastq.gz文件,不必再从sra文件转换了。地址去ENA搜索,再复制fastq.gz文件的地址,或者去ENA的ftp地址ftp.sra.ebi.ac.uk搜索,注意是ftp不是fasp。
ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR949/SRR949627/SRR949627_1.fastq.gz .
比如要下载下列数据 https://www.ncbi.nlm.nih.gov/sra
source('http://bioconductor.org/biocLite.R')
biocLite('SRAdb')
library(SRAdb)
srafile = getSRAdbFile()
con = dbConnect('SQLite',srafile)
library(GEOquery)
gse <- getGEO('GSE48138') # retrieves a GEO list set for your SRA id.
## see what is in there:
show(gse)
# There are 2 sets of samples for that ID
## what you want is table a with SRR to download and some sample information:
## lets see what the first set contains:
df <- as.data.frame(gse[[1]])
head(df)
比如
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP055%2FSRP055992/SRR1871481/SRR1871481.sra
比如,单个下载
prefetch -v SRR925811
或批量下载
for i in `seq 48 62`;
do
prefetch SRR35899${i}
done
还可以多个一起下载
先找到要下载的页面,比如https://www.ncbi.nlm.nih.gov/sra,然后右上角,send to-file,format选择accession list,保存为一个file(默认是SraAccList.txt),然后
prefetch $(<SraAccList.txt)
https://github.com/pepkit/geofetch