本文来自作者[admin]投稿,不代表东辰文化立场,如若转载,请注明出处:http://www.mzwhys.cn/wiki/202506-2017.html
没有使用统计方法来预先确定样本量。实验不是随机的 。在实验和结果评估中 ,研究人员并未对分配视而不见。
我们提取了13,133次测序运行,分类为欧洲核苷酸档案中的人肠道元基因组(ENA),其中包括75种不同的研究(补充表1)。每个采样的单个采样的元数据(位置,年龄 ,健康状况和抗生素使用情况)通过MG-Toolkit(https://pypi.org/project/mg-toolkit/)通过ENA API检索,并通过检查与每个项目的出版物(可用)进行检查。仅在其原始研究中明确说明的情况下,样本被归类为从健康个体中获得的 。
每次运行中的原始读数首先与Spades v.3.10.020一起组装 ,并带有选项-Meta21。此后,使用Metabat 215(v.2.12.1)使用最小重叠长度阈值2,000 bp(选项-Mincontig 2000)和默认参数来捆绑组件。Depth of coverage required for the binning was inferred by mapping the raw reads back to their assemblies with BWA-MEM v.0.7.1645 and then calculating the corresponding read depths of each individual contig with samtools v.1.546 (‘samtools view -Sbu’ followed by ‘samtools sort’) together with the jgi_summarize_bam_contig_depths function从metabat 2中 。使用lineage_wf工作流程估算每个元基因组组装的基因组(MAG)的QS,并计算为:完整性 - 5×污染水平。使用细菌5S ,16S和23S RRNA的RFAM48协方差模型,使用地狱V.1.1.247(选项-Z 1000 -HMMonly -CUT_GA)的CMSEARCH函数(选项-Z 1000 -Hmmonly -CUT_GA)检测到核糖体RNA(RRNA)。总比对长度是通过所有非重叠命中的总和来推断的 。如果在MAG中包含超过80%的预期序列长度,则考虑每个基因。用TRNASCAN-S.E鉴定转移RNA(TRNA)。v.2.049使用细菌tRNA模型(选项-b)和默认参数 。分类为高质量MAGS的分类基于有关元基因组组装基因组(MIMAG)标准的最低信息所定义的标准23(高:> 90%完整性和完整性和90% <5% contamination, presence of 5S, 16S and 23S rRNA genes, and at least 18 tRNAs; medium: ≥ 50% completeness and <10% contamination). Given that only 240 of the MAGs with >90% completeness and <5% contamination passed the MIMAG thresholds regarding the presence of rRNA and tRNA genes due to known issues relating to the assembly of rRNA regions16,50, we refer to our highest quality MAGs as ‘near complete’16 instead. VirFinder v.1.151 was used to predict the presence of viral contigs within the 13,133 human gut assemblies generated with SPAdes. This tool uses a k-mer-based, machine-learning approach to detect distinguishing signatures between virus and host (prokaryotic) sequences. Expected P values for the presence of viral sequences were calculated for each contig with ≥5 kb length and subsequently corrected for multiple testing using the Benjamini–Hochberg method with a FDR threshold of 10%.
Four reference databases were used to classify the set of MAGs recovered from the human gut assemblies: HR, RefSeq, GenBank and a collection of MAGs from public datasets. HR comprised a total of 2,468 high-quality genomes (>90% completeness, <5% contamination) retrieved from both the HMP catalogue (https://www.hmpdacc.org/catalog/) and the HGG8. From the RefSeq database, we used all the complete bacterial genomes available (n = 8,778) as of January 2018. In the case of GenBank, a total of 153,359 bacterial and 4,053 eukaryotic genomes (3,456 fungal and 597 protozoan genomes) deposited as of August 2018 were considered. Lastly, we surveyed 18,227 MAGs from the largest datasets publicly available as of August 201813,16,17,18,19, including those deposited in the Integrated Microbial Genomes and Microbiomes (IMG/M) database52. For each database, the function ‘mash sketch’ from Mash v.2.053 was used to convert the reference genomes into a MinHash sketch with default k-mer and sketch sizes. Then, the Mash distance between each MAG and the set of references was calculated with ‘mash dist’ to find the best match (that is, the reference genome with the lowest Mash distance). Subsequently, each MAG and its closest relative were aligned with dnadiff v.1.3 from MUMmer 3.2354 to compare each pair of genomes with regard to the fraction of the MAG aligned (aligned query, AQ) and ANI.
To dereplicate the collection of unclassified bacterial MAGs (AQ <60% or ANI <95% against the target references), high-level similarity clusters were first generated with Mash53. In brief, a MinHash sketch was created for these genomes to perform an all-against-all comparison. Then, a hierarchical clustering was built from the Mash distance relationships and individual clusters were defined at a cut-off of 0.2. Each cluster was subsequently dereplicated with dRep v.2.2.255 to extract the MAGs displaying the best quality and representing individual metagenomic species (MGS). dRep was run with options -pa 0.9 (primary cluster at 90%), -sa 0.95 (secondary cluster at 95%), -cm larger (coverage method: larger), -con 5 (contamination threshold of 5%). For the near-complete MAGs, the -nc parameter was set to 0.60 (coverage threshold of 60%), whereas for the medium-quality MAGs with a QS >50 this was changed to 0.30 (coverage threshold of 30%). The 2,468 HR genomes were also dereplicated into 956 representative species with dRep, using the criteria defined above for the near-complete MAGs. These included 553 species collected specifically from the human gut, referred to as HGR.
Genes were predicted using prodigal v.2.6.356 (default single mode) and 40 universal core marker genes from each genome were extracted using specI v.1.032. Phylogenetic trees were built by concatenating and aligning the marker genes with MUSCLE v.3.8.31. Marker genes absent only from specific genomes were kept in the alignment as missing data. Maximum-likelihood trees were constructed using RAxML v.8.1.1557 with option -m PROTGAMMAAUTO. All phylogenetic trees were visualized in iTOL58. Phylogenetic diversity was quantified by the sum of branch lengths using the phytools R package59.
Taxonomic classification of each MGS was performed with both CheckM and UniProtKB29. First, the function tree_qa from CheckM was used to infer the approximate phylogenetic placement of the MGS genome within the CheckM internal reference tree (which comprised 2,052 finished and 3,604 draft genomes). Those classified at least at the class rank were then compared with the taxonomic assignment deduced from protein alignments against UniProtKB (release 2018_04) using the blastp function of DIAMOND v.0.9.17.11860. A positive hit at the species level was inferred if ≥60% of the proteins had ≥80% of the sequence aligned with an amino acid identity of ≥96%, based on previously reported thresholds26,33. Genomes within UniProtKB were presumed to represent cultured species if labelled with a full species name lacking any of the following terms: uncultured, sp. or bacterium. For those MGS without an assigned species (UMGS), a genus-level boundary was set with the following criteria, as previously defined61: at least 50% of the proteins with an e value less than 1 × 10−5, a sequence identity of more than 40% and a query coverage above 50%. In case the taxon predicted with UniProt was missing from the CheckM reference database, the full lineage was manually inspected to determine the most likely annotation. Owing to possible mislabelling of the UniProt entries, the CheckM taxonomic lineage was kept if there were incongruences between both classifications. Lastly, the positioning of the UMGS genomes within the HGR phylogenetic tree was used to resolve further inconsistencies or misclassifications.
A random subset of 1,000 metagenomes (Supplementary Table 1) was tested with two additional approaches to assess the reproducibility of the MAGs generated here. With one of the methods, metagenomes were assembled with MEGAHIT v.1.1.324 and subsequently binned with MetaBAT 2, MetaBAT 1 and MaxBin v.2.2.462. A refinement step was then performed using the bin_refinement module from MetaWRAP v.1.025 to combine and improve the results generated by the three binners. The second method involved a modified co-assembly approach, in which individual assemblies from the same study were first merged and dereplicated with CD-HIT v.4.763 (cd-hit-est with option -c 0.99 defining a sequence identity threshold of 99%). Metagenomic datasets were then mapped to their merged, non-redundant assembly with BWA-MEM to obtain co-abundance information for binning with MetaBAT 2 (with option --minContig 2000). The resulting MAGs with a QS >使用上述1,000个数据集将每种方法获得的50与我们的主管道(带有黑桃的单个组装以及与Metabat 2的binning一起恢复的MAG)进行了比较。
为了进一步评估报告的MGS潜在污染水平 ,我们使用Matthews相关系数(MCC)分析了包含每个MGS的MASH簇的质量。首先,使用比较V.0.0.23(https://github.com/dparks1134/comparem)分析了MASH簇之间和之间的特定标记基因的平均氨基酸身份(AAI) 。为了能够估计MCC,真正的阳性 ,假否定性,假阳性和真正的负面因素,是根据三种不同的AAI阈值确定的:90%,95%和97%。对于每个成对比较 ,当两个MAG属于同一群集并且具有等于阈值或高于阈值的AAI时,我们都认为是一个真正的阳性。假否定性如果它们属于同一集群,但AAI低于阈值;当基因组包含在不同的簇中时 ,误报是阳性,但它们的AAI等于或高于阈值。真正的负阴性对应于来自不同簇的基因组,其AAI低于阈值 。此后 ,通过MLTools64 R软件包的MCC函数计算MCC。可能的值范围为-1至1,其中1表示Mash聚类与标记基因AAI之间的完美一致。
对1,952个UMG进行了功能预测分析,并将553 HGR基因组的解换集进行 。预测基因首先用Intercoscan v.5.27-66.036进行了功能表征 ,并带有-goterms和-pa。使用Antismash 435推断出微生物BGC的存在,使用选项-Knowclusterblast确定与Mibig存储库相匹配的BGC的数量。根据InterPro(IPR)条目为每个基因推定GO39,40注释,并使用http://github.com/ebi-pf-team/genome-properties中的agiss_genome_properties.pl脚本转换为gps37,38 。Ghostkoala42用于生成蛋白质编码序列的KO注释。使用组成数据分析工具ALDEX265对UMGS和HGR基因组之间的GO Slim和KO项频率进行差异丰度分析。因为我们正在评估具有不同长度和完整性程度的基因组 ,因此该方法用于考虑总基因计数的差异 。ALDEX.CLR函数与从Dirichlet分布采样的128个Monte Carlo实例一起使用,以生成与观察到的数据一致的每个GO SLIM/KO项的概率分布。随后将这些转换为日志比的分布,以说明数据的组成性质。ALDEX.EFFECT函数用于计算每个组分布之间差异的预期值(中值LOG2差异),汇总组方差的预期值(中值LOG2分散)以及对每个GO/KO分类的丰度差的标准化效果大小 。所使用的效应大小度量在概念上与Cohen的D相似 ,但根据分布本身而不是这些分布的汇总统计数据进行计算,从而导致指标相对稳健且有效66。最后, Aldex.ttest用于对两个测试组(UMGS和HGR)之间的GO/KO频率进行非参数Wilcoxon rank-sum测试。分类为“是” ,“否 ”和“部分”的GP被转换为2 、0和1,在UMGS基因组中更普遍的GP被转换为两尾χ2检验。使用Benjamini -Hochberg方法校正了所有统计检验的预期P值 。使用FactorMiner67软件包,对HGR和UMGS基因组的GP分布进行了PCA。根据GP剖面之间的GOWER距离 ,根据ANOSIM测试评估了根据门类和基因组类型的分离。
使用Sourmash v.2.0.0a468对HR,RefSeq和UMGS基因组收集进行了13,133个人类肠道元基因组数据集的读取分类 。签名文件是为参考(FASTA)和查询(FASTQ)文件生成的,并带有“ Sourmash Compute -Scaled 1000 -K 31- track -budundance”。对于每组参考 ,创建了一个最低的共同祖先数据库(“ Sourmash LCA指数 - 标准的1000 -K 31”),每个基因组代表一个独特的物种谱系。然后将原始读取与每个数据库的“ Sourmash LCA收集 ”进行了比较 。物种患病率和丰度是通过BWA-MEM确定的,在评估基因组覆盖水平 ,平均读取深度和深度均匀度来推断出物种存在。首先,我们计算了对应于对应于缺失覆盖率(100% - 基因组覆盖率)的深度和变化惩罚得分,乘以对数(平均深度)或变化的深度系数(分别定义为读取深度除以平均值的标准偏差)。这些指标使我们能够同时衡量覆盖范围和深度,因为与具有较低读取深度的覆盖率相比 ,样品中具有较高的平均深度(或高深度变化)但覆盖不善的基因组的可能性不大 。确定基因组存在的阈值的最小覆盖范围至少为60%,并在最大第99个百分位数的深度和变化惩罚分数设置(扩展数据图7)。每个物种的相对丰度取决于唯一映射和正确配对的读数的比例(使用“ samtools View -q 1 -f 2”过滤的总读数计数。根据每个采样间隔,基于每个地理区域检测到的UMG数量的累积曲线被引导十次 。使用R Stats软件包69的SSASYMP和NLS函数进行渐近回归。
有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。
用于生成数据的自定义脚本可在https://github.com/finn-lab/mgs-gut上找到。
赞 (3)
评论列表(4条)
我是东辰文化的签约作者“admin”!
希望本篇文章《人类肠道微生物群的新基因组蓝图》能对你有所帮助!
本站[东辰文化]内容主要涵盖:生活百科,小常识,生活小窍门,知识分享
本文概览: 没有使用统计方法来预先确定样本量。实验不是随机的。在实验和结果评估中,研究人员并未对分配视而不见。 我们提取了13,133次测序运行,分类为欧洲核苷酸档案中的人肠...