新冠疫情席卷全球,我们尚不得知病毒的发源地及特效治疗方案。最有效的防控手段还是在疫情爆发的早期做好隔离工作,切断传播途径。例如前些日子的北京,局部爆发后迅速控制到新增0,而对比美利坚每天新增数万… 这其中,离不开快速平行检测技术加持。
目前,病原快速检测根据检测对象可以分为两大类:核酸和抗体。目前,在公共场所大量使用的快速检测会采用速度更快的抗体法进行检测(例如胶体金法),但是由于抗体的出现不能够完全说明是否正在发生感染还是已获得免疫的人群,因此还需要再通过核酸检测进行确认,包括特异扩增的PCR和无偏差的mNGS;当然研究中常用的三代全长或者ELISA法等由于成本、时间等因素并不是适用于普通检测。值得一提的是,基于Cas13a的 SHERLOCKv2技术在之前的病毒检测上具有极大的应用场景,在今年2月份张锋团队也公开了检测新冠的protocol,这里先不展开。
PCR技术(包括多重)重点在于对不同检测对象的引物设计,说实话是有技术壁垒的。mNGS虽然越过这个技术坎,但是在测序深度(成本)与鉴定灵敏度、特异性上的权衡,也是一个值得探讨的科学问题。端午期间旁听了部分NGS开发者大会的内容,陈实富博士分享的 fastv 流程听起来不错,下面就简单的扒一扒哈~
一句话介绍:
A Computational Toolset for Rapid Identification of SARS-CoV-2, other Viruses, and Microorganisms from Sequencing Data.
其实上面是预印本的标题啦。主要是两个信息,一是个分析工具,而是除了新冠病毒的快速检测还包含了其他测序数据的微生物鉴定,也就是基于mNGS的数据分析流程。
流程图很复杂,但是上手只需两步
# this binary was compiled on CentOS, and tested on CentOS/Ubuntu
wget http://opengene.org/fastv/fastv
chmod a+x ./fastv
# make sure that SARS-CoV-2.kmer.fa and SARS-CoV-2.genomes.fa are in the ./data folder
./fastv -i testdata.fq.gz
然后可以通过报告直接浏览结果,包括鉴定、序列数、覆盖度。
例如在新冠病毒鉴定结果中,可以在shell中
./fastv -i filename.fastq.gz -k SARS-CoV-2.kmer.fa -g SARS-CoV-2.genomes.fa
及HTML report结果中查看。
这报告是不是觉得有点眼熟,很像fastp。一般处理测序数据,第一步都会进行质控,fastv就是fastp的作者开发的,因此fastv的质控调用了fastp的处理方式。同时,输入的序列支持短读(Illumina,BGI等)和长读(ONT,PacBio等)平台。
常用参数列表
# Key options:
-i, --in1 read1 input file name (string [=])
-I, --in2 read2 input file name (string [=])
-o, --out1 file name to store read1 with on-target sequences (string [=])
-O, --out2 file name to store read2 with on-target sequences (string [=])
-c, --kmer_collection the unique k-mer collection file in fasta format, see an example: http://opengene.org/kmer_collection.fasta (string [=])
-k, --kmer the unique k-mer file of the detection target in fasta format. data/SARS-CoV-2.kmer.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])
-g, --genomes the genomes file of the detection target in fasta format. data/SARS-CoV-2.genomes.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])
-p, --positive_threshold the data is considered as POSITIVE, when its mean coverage of unique kmer >= positive_threshold (0.001 ~ 100). 0.1 by default. (float [=0.1])
-d, --depth_threshold For coverage calculation. A region is considered covered when its mean depth >= depth_threshold (0.001 ~ 1000). 1.0 by default. (float [=1])
-E, --ed_threshold If the edit distance of a sequence and a genome region is <=ed_threshold, then consider it a match (0 ~ 50). 8 by default. (int [=8])
--long_read_threshold A read will be considered as long read if its length >= long_read_threshold (100 ~ 10000). 200 by default. (int [=200])
--read_segment_len A long read will be splitted to read segments, with each <= read_segment_len (50 ~ 5000, should be < long_read_threshold). 100 by default. (int [=100])
--bin_size For coverage calculation. The genome is splitted to many bins, with each bin has a length of bin_size (1 ~ 100000), default 0 means adaptive. (int [=0])
--kc_coverage_threshold For each genome in the k-mer collection FASTA, report it when its coverage > kc_coverage_threshold. Default is 0.01. (double [=0.01])
--kc_high_confidence_coverage_threshold For each genome in the k-mer collection FASTA, report it as high confidence when its coverage > kc_high_confidence_coverage_threshold. Default is 0.9. (double [=0.9])
--kc_high_confidence_median_hit_threshold For each genome in the k-mer collection FASTA, report it as high confidence when its median hits > kc_high_confidence_median_hit_threshold. Default is 5. (int [=5])
-j, --json the json format report file name (string [=fastv.json])
-h, --html the html format report file name (string [=fastv.html])
-R, --report_title should be quoted with ' or ", default is "fastv report" (string [=fastv report])
-w, --thread worker thread number, default is 4 (int [=4])
如果要检测其他的微生物,需要指定对应的配置文件。
我们从参数表中也可以看到,fastv也是基于k-mer算法的。因此区分确认物种的高质量K-mer是鉴定的关键。若需要自定义鉴定,那么可以搭配使用 UniqueKMER 构建库。
# this binary was compiled on CentOS, and tested on CentOS/Ubuntu
wget http://opengene.org/uniquekmer/uniquekmer
chmod a+x ./uniquekmer
# simple example
uniquekmer -f test.fasta
# 16-mer (i.e. ATCGATCGATCGATCG...)
uniquekmer -f test.fasta -k 16
常用参数列表
-f, --fasta FASTA input file name (string)
-o, --outdir Directory for output. Default is unique_kmers in the current directory. (string [=unique_kmers])
-k, --kmer The length k of k-mer (3~32), default 25 (int [=25])
-s, --spacing If a key with POS is recorded, then skip [POS+1...POS+spacing] to avoid too compact result (0~100). default 0 means no skipping. (int [=0])
-g, --genome_limit Process up to genome_limit genomes in the FASTA input file. Default 0 means no limit. This option is for DEBUG. (int [=0])
-r, --ref Reference genome FASTA file name. Specify this only when you want to filter out the unique k-mer that can be mapped to reference genome. (string [=])
-e, --edit_distance k-mer mapped to reference genome with edit distance <= edit_distance will be removed (0~16). 3 for default. (int [=3])
-?, --help print this message
构建唯一 k-mer 时,参数 -r 考虑与人参考基因组(GRCh38)是否重叠,通过 -e (edit_distance <= 3)调整阈值。
构建好新的 k-mer 文件后通过 -c 参数进行添加。
./fastv -i filename.fastq.gz -c microbial.kc.fasta.gz
fastv除了网页版报告,同时支持json格式输出,需要提取关键信息的直接处理该结果文件即可。
宏基因组研究中,用 K-mer 鉴定的流程有SPINGO、Kraken2等,这也是生物信息学中常用的序列鉴定方法。在27个阳性 SARS-COV-2 和25个阴性样本的验证中, fastv 实现100%的特异性和100%的敏感性,同时能够用作区分 SARS-COV-2、 MERS 及其他冠状病毒;同时,在 EB病毒(EBV),人乳头状瘤病毒(HPV)和乙型肝炎病毒(HBV)的测试中也具有良好表现。
Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, MERS, and other coronaviruses.
以上测试所使用的数据都可以在 https://github.com/OpenGene/fastv 中获得,分析速度还挺快,有兴趣的可以试试。
就酱,下期见
参考资料
https://github.com/OpenGene/fastv
Shifu Chen, Changshou He, Yingqiang Li, Zhicheng Li, Charles E Melançon III bioRxiv 2020.05.12.092163
FASTP | 极速全能的 FASTQ 预处理神器
中华人民共和国国家卫生健康委员会. 新型冠状病毒肺炎诊疗方案(试行第七版)[EB/OL]. [2020-03-04].
ChiuC Y, Miller S A. Clinical metagenomics[J]. Nat Rev Genet, 2019, 20(6): 341-355.
Lieberman JA et al. Comparison ofcommercially available and laboratory developed assays for in vitro detectionof SARS-CoV-2 in clinical laboratories. JClin Microbiol 2020 Apr 29.
Multipleapproaches for massively parallel sequencing of HCoV-19 (SARS-CoV-2) genomesdirectly from clinical samples. bioRxiv 2020.03.16.993584.