Barcoding of individual DNA molecules


Barcoding of individual DNA molecules

什么是Barcode

We begin with reviewing different embodiments of this concept that apply to the short-read platforms, which currently comprise most of the NGS market. In the subsequent section, we discuss consensus sequencing approaches that apply to the com- mercially available long-read platforms that rely on direct sequencing of single DNA molecules.

在过去的这些年,虽然NGS越来越成熟,通量越来越高,但是其精确度缺没有什么提高,在某些高通量平台甚至测序精确度有所下降。考虑某些生化水平的错误是不可避免的,约2009年,出现了一种创新的方法来识别和过滤错误信息,而不是整体上通过“preventing”的方法,来提高测序的准确度。

这种方法,最后被称为single-molecule consensus sequencing, tag-based error correction or molecular barcoding(单分子共识测序,基于标签的错误校正,或者,分子条形码)。这种方法迅速称为了高精确度NGS测序应用的新标准。

而其中的关键标签Tag,就被成为Barcode。

给individual的DNA分子打上条形码Barcode

image-20200218000037650


测序时,可以将UMI标记的adapters连接至文库,从而唯一标记每条单链;

尽管复合物中的两条链都被标记了,但没有提供任何方法将一条链的共有性与其配对的共有性联系起来进行比较,而且早期PCR错误(三角形)可能无法识别。

UMI-tailed adapters can be ligated to a library to uniquely mark each single strand.

Despite both strands in a complex being tagged, no means are provided to relate the consensus of one strand to that of its mate for comparison, and early PCR errors (triangles) may go unrecognized.
  • During conventional short-read platform NGS, a DNA library is typically PCR amplified before sequencing. 在常规的短读平台NGS中,通常在测序前对DNA文库进行PCR扩增。

  • It is often impossible to definitively know whether two identical sequence reads arose from copies of the same starting molecule or from two independent molecules. 对于测得的两个相同的序列,通常不可能确切地确定它们是从同一起始分子来的的PCR拷贝还是从两个独立分子的PCR拷贝中读出了两个相同的序列。

  • However, if a unique tag (that is, a molecular barcode) is applied to each molecule before amplification, this label will be propagated to all derivative copies and independent sequence reads can thus be recognized as having arisen from a common founder.

    但是,如果在PCR扩增之前,将互相独立的Barcode标签连接到每个DNA片段分子上,这个标签就能够被所有从这条DNA原始序列扩增得到的Copy所携带,从而可以确定相同的reads是否来自共同的一个原始DNA片段

  • It is worth noting that the concept of a molecular barcode (also known as a unique molecular identifier (UMI), a single-molecule identifier (SMI) or simply a tag) is different from that of an index sequence. 得注意的是,分子条形码(也称为“唯一分子标识符”(UMI),“单分子标识符”(SMI)或简称为标签)的概念与测序流程中的index序列是不同的!

  • Molecular barcodes serve to uniquely label individual molecules within a sample, whereas index sequences are identical DNA labels that are affixed to all molecules in a given sample for the purpose of sample multiplexing. 分子条形码用于唯一标记样品中的单个分子,而索引序列是相同的DNA标记,这些标记被附加到给定样品中的所有分子上,以进行样品多路复用

  • Molecular barcodes can be used to improve the accuracy of counting DNA or RNA molecules in mixtures by eliminating biases from variable amplification . 通过消除可变扩增的偏差,分子条形码可用于提高混合物中DNA或RNA分子计数的准确性。

  • More importantly, because all identically tagged reads will have derived from a common founder (provided that barcodes are designed carefully), any variation between their actual sequences must necessarily reflect technical errors 更重要的是,由于所有标记相同的reads都将来自一个共同的DNA片段(假设条形码是经过精心设计的),因此它们实际序列(这些reads)之间的任何差异都一定反映了测序技术的错误

  • Tag-based error correction relies on this principle: 基于标签的错误纠正依靠以下原理:

    • independent reads sharing a common tag are recognized and grouped as amplicon copies of the same starting molecule; 识别具有共同标签的独立读段,并将其分组为同一起始DNA分子的扩增副本
    • any sites of sequence differences among the reads are discounted as errors when forming a consensus sequence (FIG. 2). 当形成共有序列时(含有共同barconde的序列被称为共有序列),reads之间的任何位置的序列差异都被作为错误而去除(图2)。
  • A fundamental element of the approach is the need to intentionally produce and sequence redundant molecular copies, which requires relatively higher raw sequencing depth than conventional NGS and, thus, additional costs. 该方法的基本要素是需要有意产生冗余分子拷贝并对其进行测序,与传统的NGS相比,这需要相对较高的原始测序深度,因此需要额外的成本

  • Molecular barcodes come in two forms: exogenous and endogenous. 分子条形码有两种形式:外源性和内源性。

  • Exogenous barcodes entail random or semi-random artificial sequences that are incorporated into either sequencing adapters or PCR primers. 外源条形码需要将随机或半随机的人工序列整合到测序的adapter或PCR引物中。

  • Endogenous barcodes describe the randomly or semi-randomly generated fragmentation points at the ends of DNA molecules in ligation-based library preparation methods. 内源条形码描述了基于连接的文库制备方法中DNA分子末端随机或半随机产生的片段化点。【没太理解】

  • The two approaches can be used either alone or in combination.

    两种方法可以单独使用,也可以组合使用。

  • With either approach, it is important that a sufficient variety of possible tag sequences exist such that the probability of two independent molecules being tagged the same way is low. 无论采用哪种方法,重要的是要存在足够多的可能的标记序列,以使两个独立分子以相同方式标记的可能性低

  • With sequencing at low molecular depth, the chance of two independent DNA fragments having the same shear points by chance is small, and these endogenous sequences alone suffice as tags. 在低分子深度进行测序时,两个独立的DNA片段偶然具有相同剪切点的机会很小这些内源序列单独就足以作为标签

  • At the other extreme is deep sequencing following an amplicon-based library preparation. 另一个另一个极端是基于扩增产物的文库制备后的深度测序。

  • In this case, molecular ends are defined by invariant primer sites, not random fragmentation, so all tag information must come from exogenous tags. 在这种情况下,分子末端是由恒定的引物位点而非随机片段定义的,因此所有标签信息都必须来自外源标签

  • A similar problem arises with targeted enzymatic fragmentation. 靶向酶片段化也会出现类似的问题。

  • If barcode diversity is inadequate, tag clashes can occur, whereby independent molecules are identically labelled. 如果条形码多样性不足,则会发生标签冲突,从而独立的分子被相同的标记了。

  • In this scenario, true low-frequency variants can be erroneously discarded as errors. 在这种情况下,真实的低频变体可能会被错误地视为测序错误而丢弃。

  • If barcodes are too complex, they may develop errors themselves and artificially create false families that incorrectly appear as arising from distinct molecules. 如果条形码太复杂(太长),它们可能会自身产生错误,并人为地创建错误的族(指相同Barcode的reads),这些由于Barcode序列出错而被识别为不同的族,会被错误地被认为是来自不同的原始DNA分子。

  • Both problems can be mitigated with careful design and strategies for tolerating errors in barcodes. 通过精心设计和容忍条形码错误的策略,可以缓解这两个问题。

  • Over the past 5 years, molecular consensus sequencing has proved itself as the most impactful means for reducing NGS errors. 在过去的5年中,分子共有序列测序已证明是减少NGS错误最有效的方法

  • Different implementations variably reduce sequencing error rates from ~$10^{-2}$ to $10^{−4}–10^{−7}$ or lower.

  • The variety of approaches developed to date can be grouped into three basic categories: 迄今为止开发的各种方法可以分为三个基本类别:

    • single-strand consensus sequencing;

      单链共有序列

    • two-strand consensus sequencing; 双链共有序列

    • and duplex consensus sequencing (FIG. 2).【啥是duplex consensus sequencing?】

Duplex Sequencing基于二代测序技术原理,通过独立的添加标签到reads两端,使得互补的两条单链通过PCR扩增形成一个可以通过唯一标签识别的reads家族,再利用单链矫正和双链互相矫正的方法排除错误,减小错误率。

众所周知,如果两条互补链是完整的,那么真的突变应该在两条链上都有发生,相反如果是PCR或着是测序过程产生的随机错误则只会发生在一条链上。而对于那些只发生在一条链上突变,很可能是DNA双链完整性遭到破坏导致的,后续可用来分析DNA损伤发生的位点情况。

image-20200217235255228

Duplex Sequencing 原理示意图

Reference

Salk J J, Schmitt M W, Loeb L A. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations[J]. Nature Reviews Genetics, 2018, 19(5): 269.

http://www.360doc.com/content/18/1215/12/52645714_801952631.shtml