表观基因组学研究方法——ATAC-seq

date: 2023-11-16 update slug: ATAC-seq key: ATAC-seq,Bioinformatics,Epigenetics ref: https://informatics.fas.harvard.edu/atac-seq-guidelines cover:

by 赵华男

ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing)

  • peak calling
  • peak independent / dependent quality metrics
  • differential accessibility analysis

  • ATAC-seq is a method for determining chromatin accessibility across the genome. It utilizes a hyperactive Tn5 transposase to insert sequencing adapters into open chromatin regions. High-throughput sequencing then yields reads that indicate these regions of increased accessibility. ATAC-seq是一种确定整个基因组染色质可及性的方法。它利用高活性的Tn5转座酶将测序接头插入开放的染色质区域。然后,高通量测序产生指示这些区域可访问性增加的读取。

Overview of ATAC-seq (Buenrostro et al., 2015)

⭐️bulk ATAC-seq

General QC

workflow

  • Data processing
    • Remove low quality alignments 删除低质量的对齐
    • Remove duplicate reads 删除重复读取
    • Remove blacklist regions 删除黑名单区域
  • Peak Independent Quality Control 与peak无关的质量控制
    • Collect metrics(mapping、filtering) 收集指标(mapping、filtering)
    • Replicate concordance 重复间的一致性
      • ChIP-seq specific metrics ChIP-seq特殊指标
      • ATAC-seq specific metrics ATAC-seq特殊指标

The aim of this part is to perform general signal structure agnostic (i.e. peak - independent) quality control (lower-right part of the concept map). 这一部分的目的是执行与信号结构无关(即与peak无关)的常规质量控制(概念图的右下部分)。

These include: assessment of read coverage along the genome 评估基因组的读取覆盖率 replicate congruency 重复间的一致性

plotFingerprint

plotFingerprint \
    --bamfiles \
        ENCFF363HBZ.chr14.proc.bam \
        ENCFF398QLV.chr14.proc.bam \
        ENCFF045OAB.chr14.proc.bam \
        ENCFF828ZPN.chr14.proc.bam \
    --binSize=1000 \
    --plotFile NKcellsATAC_chr14.fingerprint.pdf \
    --labels \
        ENCFF363HBZ \
        ENCFF398QLV \
        ENCFF045OAB \
        ENCFF828ZPN \
    -p 8 \
    &> fingerprint.log

Have a look at NKcellsATAC_chr14.fingerprint.pdf, read deepTools What the plots tell you: 

An ideal input with perfect uniform distribution of reads along the genome (i.e. without enrichments in open chromatin etc.) and infinite sequencing coverage should generate a straight diagonal line. 在理想情况下,如果input的reads沿着基因组均匀分布(即,没有在open chromatin富集的input),则该样本会在途中产生一条直线对角线. A very specific and strong ChIP enrichment will be indicated by a prominent and steep rise of the cumulative sum towards the highest rank. 一个非常specific和strong的 ChIP 富集将通过累积总和向最高等级的显着和急剧上升来表示。(就像图中表述的一样陡峭的上升) This means that a big chunk of reads from the ChIP sample is located in few bins which corresponds to high, narrow enrichments typically seen for transcription factors. 这意味着来自 ChIP 样品的大量reads分布于少量的bins,这对应于转录因子通常看到的peak很高且window很窄的富集模式。

and answer * does it indicate a good sample quality, i.e. signal present in narrow regions?

Fingerprint for ATAC-seq signal in NK cells.

Replicate Clustering

To assess overall similarity between libraries from different samples one can compute sample clustering heatmaps using multiBamSummary and plotCorrelation in bins mode from deepTools. 要评估来自不同样本的库之间的整体相似性,可以使用 deepTools 的 multiBamSummary 和 plotCorrelation 在箱模式下计算样本聚类热图。

In this method the genome is divided into bins of specified size (--binSize parameter) and reads mapped to each bin are counted. The resulting signal profiles are used to cluster libraries to identify groups of similar signal profile. 在这种方法中,基因组被划分为指定大小的箱(--binSize参数),并计数映射到每个箱的读取。生成的信号配置文件用于对库进行聚类,以识别相似信号配置文件的组。

We chose to compute pairwise Spearman correlation coefficients for this step, as they are based on ranks of each bin rather than signal values. 我们选择在此步骤中计算成对的 Spearman 相关系数,因为它们基于每个箱的秩而不是信号值。

In this part we use bam files filtered previously, to save time. 在这一部分中,我们使用之前过滤的bam文件,以节省时间。

multiBamSummary bins \
    --bamfiles \
        ENCFF363HBZ.chr14.proc.bam \
        ENCFF398QLV.chr14.proc.bam \
        ENCFF045OAB.chr14.proc.bam \
        ENCFF828ZPN.chr14.proc.bam \
    --labels \
        ENCFF363HBZ \
        ENCFF398QLV \
        ENCFF045OAB \
        ENCFF828ZPN \
    --outFileName multiBamArray_NKcellsATAC_chr14.npz \
    --binSize 5000 \
    -p 8 \
    &> multiBamSummary.log

plotCorrelation \
    --corData multiBamArray_NKcellsATAC_chr14.npz \
    --plotFile NKcellsATAC_chr14_correlation_bin.pdf \
    --outFileCorMatrix NKcellsATAC_chr14_correlation_bin.txt \
    --whatToPlot heatmap \
    --corMethod spearman

You can copy the resulting file to your local system to view it.

What do you think? which samples are similar? are the clustering results as you would have expected them to be?

Correlation of binned ATAC-seq signal in NK cells.

Correlation of binned ATAC-seq signal in NK cells.

In addition to these general procedures, several specialised assay - specific quality metrics exist, which probe signal characteristics related to each method. These are key QC metrics to evaluate the experiment and should always be colleced during the QC step. The method specific tutorials are: ATACseq and ChIPseq.除了这些一般程序之外,还存在几种专门的检测 - 特定的质量指标,这些指标探测与每种方法相关的信号特征。这些是评估实验的关键QC指标,应始终在QC步骤中收集。特定于方法的教程是:ATACseq 和 ChIPseq。

We can now follow with ATACseq specifc QC methods.我们现在可以遵循ATACseq特定的QC方法。

Specifc QC methods for ATACseq

This tutorial is a continuation of General QC

The aim of this part of the data analysis workflow is to collect ATAC-seq specific quality metrics: fragment length distribution; presence of signal in nuclesome-free regions (NFR) and mononucleosome fractions; * enrichment of signal in transcription start site (TSS) regions.

Fragment Length Distribution

In ATAC-seq experiments, tagmentation of Tn5 transposases produces signature size pattern of fragments derived from nucleosome-free regions (NFR), mononucleosome, dinucleosome, trinucleosome and longer oligonucleosome from open chromatin regions (Figure below, adapted from Li et al ). 在ATAC-seq实验中,Tn5转座酶的标记产生来自无核小体区域(NFR, nucleosome-free regions),单核小体,二核小体,三核小体和来自开放染色质区域的较长寡核小体的片段的特征大小模式(下图,改编自Li等人)。

Please note the pre-filtered BAM files need to be used to get an unbiased distribution of insert fragment size in the ATAC-seq library. 请注意,需要使用预过滤的 BAM 文件来获取 ATAC-seq 库中插入片段大小的无偏分布。

Lietal_atacTn5

To compute fragment length distribution for processed bam file in our ATAC-seq data set (assuming we are in drectory analysis): 要计算 ATAC-seq 数据集中已处理 bam 文件的片段长度分布(假设我们处于analysis路径中):

mkdir QC
cd QC

ln -s ../processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam  .
ln -s ../processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam.bai  .

module load picard/2.23.4

java -Xmx63G -jar $PICARD_HOME/picard.jar CollectInsertSizeMetrics \
 -I ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam \
 -O ENCFF045OAB.chr14.proc.fraglen.stats \
 -H ENCFF045OAB.chr14.proc.fraglen.pdf -M 0.5

You can copy the resulting file to your local system to view it.

Have a look at ENCFF045OAB.chr14.proc.fraglen.pdf, and answer does it indicate a good sample quality? is the chromatin structure preserved? what do the periodic peaks correspond to?

Fragment length histogram of ATAC-seq signal in NK cells.

ENCFF045OAB.chr14.proc.fraglen.pdf

View the resulting histogram of insert sizes SRR891268_insert_size_histogram.pdf. Generating this important QC plot is only possible for PE libraries. Could you guess what the peaks at approximately 50bp, 200bp, 400bp and 600bp correspond to?

To give some context compare to plots on Figure 2.

Naked DNA Failed ATAC-seq Noisy ATAC-seq Successful ATAC-seq
Naked DNA Failed ATAC-seq Noisy ATAC-seq Successful ATAC-seq

Figure 2. Examples of insert size distribution for ATAC-seq experiments.

Data Preparation for Probing Signal at TSS

We will be working in R in this section. First, we load the required version together with libraries:

library(ATACseqQC)
library(BSgenome.Hsapiens.UCSC.hg38)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(ChIPpeakAnno)
library(Rsamtools)

# We can now give the path to the processed bam file:

bamFile="ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam"
bamFileLabels <- "ENCFF045OAB"

# We collect library statistics:
bam_qc=bamQC(bamFile, outPath = NULL)

# We can now inspect the statistics:
bam_qc[1:10]

# The output:
bam_qc[1:10]

$totalQNAMEs    12478862
$duplicateRate  0
$mitochondriaRate   6.82037272614791e-07
$properPairRate 0.997540814313988
$unmappedRate   0
$hasUnmappedMateRate    0.00129988280192466
$notPassingQualityControlsRate  0
$nonRedundantFraction   1.00037904097345
$PCRbottleneckCoefficient_1 0.999988545094877
$PCRbottleneckCoefficient_2 120034.538461538

Most of these values are meaningless at this point, as we have already processed the bam file. 此时,这些值中的大多数都没有意义,因为我们已经处理了 bam 文件。

Shiftig and Splitting Aligned Reads

Tagmentation by Tn5 transposase produces 5’ overhang of 9 base long, the coordinates of reads mapping to the positive and negative strands need to be shifted by + 4 and - 5, respectively, to account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase and achieve base-pair resolution of TF footprint and motif-related analyses.Tn5转座酶的标记产生9个碱基长的5'突出,映射到正链和负链的读段坐标需要分别移动+ 4和-5,以解释Tn5转座酶对切口的DNA修复产生的9-bp重复,并实现TF足迹和基序相关分析的碱基对分辨率。

We perform it at this point to plot signal at TSS, and we save the resulting object for later use.此时,我们执行它以在 TSS 处绘制信号,并保存结果对象供以后使用。

We create a directory where the processed bam files will be saved:我们创建一个目录,其中将保存已处理的 bam 文件:

## files will be output into outPath
outPath <- "splitBam"
dir.create(outPath)

# First, we collect information on which SAM/BAM tags are present in our bam file: 首先,我们收集有关 bam 文件中存在哪些 SAM/BAM 标签的信息:
possibleTag <- combn(LETTERS, 2)
possibleTag <- c(paste0(possibleTag[1, ], possibleTag[2, ]),
                 paste0(possibleTag[2, ], possibleTag[1, ]))

bamTop100 <- scanBam(BamFile(bamFile, yieldSize = 100),
                     param = ScanBamParam(tag = possibleTag))[[1]]$tag
tags <- names(bamTop100)[lengths(bamTop100)>0]

# We shift the coordinates only for alignments on chr14, which is where most of our data is:# 我们只在 chr14 上移动坐标,这是我们大部分数据所在的位置:
seqlev <- "chr14"
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")

# We create an object with genomic alignments:
gal <- readBamFile(bamFile, tag=tags, which=which,asMates=TRUE, bigFile=TRUE)

# This object is empty, because we used bigFile=TRUE - this is expected, so do not be alarmed.
#The function shiftGAlignmentsList in the ATACseqQC package is used for shifting the alignments:
# 这个对象是空的,因为我们使用了 bigFile=TRUE - 这是意料之中的,所以不要惊慌。 
#The ATACseqQC 包中的函数 shiftGAlignmentsList 用于移动对齐方式:
shiftedBamFile <- file.path(outPath, "shifted.bam")
gal1 <- shiftGAlignmentsList(gal, outbam=shiftedBamFile)

### save the GRanges object for future use
saveRDS(gal1, file = "gal1.rds", ascii = FALSE, version = NULL,compress = TRUE, refhook = NULL)

Next, we split the shifted alignments into different fractions by length (nucleosome free, mononucleosome, dinucleosome, and trinucleosome). 接下来,我们按长度将转移的比对分解成不同的部分(游离核小体、单核小体、二核小体和三核小体)。

Shifted reads that do not fit into any of the above bins can be discarded.不适合上述任何箱的移位读取可以丢弃。

Splitting reads is a time-consuming step because we are using random forest to classify the fragments based on fragment length and GC content.拆分读取是一个耗时的步骤,因为我们使用随机森林根据片段长度和GC含量对片段进行分类。

By default, we assign the top 10% of short reads (reads below 100_bp) as nucleosome-free regions and the top 10% of intermediate length reads as (reads between 180 and 247 bp) mononucleosome. This serves as the training set to classify the rest of the fragments using random forest.默认情况下,我们将前 10% 的短读段(低于 100_bp 的读段)指定为无核小体区域,将前 10% 的中间长度读段指定为(读数在 180 到 247 bp 之间)单核小体。这用作使用随机森林对其余片段进行分类的训练集。

We need genomic locations of TSS:我们需要 TSS 的基因组位置:

txs <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
txs <- txs[seqnames(txs) %in% "chr14"]
genome <- Hsapiens

# We split the alignments (this process takes a few minutes):
objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome, outPath = outPath)
# When done, we save the object for later use:
saveRDS(objs, file = "objs.rds", ascii = FALSE, version = NULL,compress = TRUE, refhook = NULL)

Signal in NFR and Mononucleosome Fractions

# Files we are going to use and TSS coordinates:

bamFiles <- file.path(outPath,
                     c("NucleosomeFree.bam",
                     "mononucleosome.bam",
                     "dinucleosome.bam",
                     "trinucleosome.bam"))

TSS <- promoters(txs, upstream=0, downstream=1)
TSS <- unique(TSS)

# Calculate and log2 transform the signal around TSS:

librarySize <- estLibSize(bamFiles)

NTILE <- 101
dws <- ups <- 1010
sigs <- enrichedFragments(gal=objs[c("NucleosomeFree",
                                     "mononucleosome",
                                     "dinucleosome",
                                     "trinucleosome")],
                          TSS=TSS,
                          librarySize=librarySize,
                          seqlev=seqlev,
                          TSS.filter=0.5,
                          n.tile = NTILE,
                          upstream = ups,
                          downstream = dws)

sigs.log2 <- lapply(sigs, function(.ele) log2(.ele+1))

# We can now save the heatmap:

pdf("Heatmap_splitbam.pdf")
featureAlignedHeatmap(sigs.log2, reCenterPeaks(TSS, width=ups+dws),
                      zeroAt=.5, n.tile=NTILE)

dev.off()

What are the differences in the signal profile in these two fractions? Why do we observe them? 这两个分数的信号分布有什么不同?我们为什么要观察它们?

Heatmap_splitbam.pdf

Signal at TSS

# We can now calculate signal distribution at TSS:
out <- featureAlignedDistribution(sigs,
                                  reCenterPeaks(TSS, width=ups+dws),
                                  zeroAt=.5, n.tile=NTILE, type="l",
                                  ylab="Averaged coverage")

## rescale the nucleosome-free and nucleosome signals to 0~1 for plotting 将无核小体和核小体信号rescale为 0~1 进行绘制
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
out <- apply(out, 2, range01)

# And plot it:
pdf("TSSprofile_splitbam.pdf")
        matplot(out, type="l", xaxt="n",
        xlab="Position (bp)",
        ylab="Fraction of signal")
        axis(1, at=seq(0, 100, by=10)+1,
     labels=c("-1K", seq(-800, 800, by=200), "1K"), las=2)
        abline(v=seq(0, 100, by=10)+1, lty=2, col="gray")
dev.off()

What are the differences in the signal profile in these two fractions? Why do we observe them? 这两个分数的信号分布有什么不同?我们为什么要观察它们?

TSSprofile_splitbam.pdf

Signal Visualisation Using IGV

In this part we will look more closely at our data, which is a good practice, as data summaries can be at times misleading. In principle we could look at the data on Uppmax using installed tools but it is much easier to work with genome browser locally. If you have not done this before the course, install Interactive Genome Browser IGV. 在这一部分中,我们将更仔细地研究我们的数据,这是一个很好的做法,因为数据摘要有时会产生误导。原则上,我们可以使用已安装的工具查看Uppmax上的数据,但是在本地使用基因组浏览器要容易得多。如果您在课程之前没有这样做,请安装交互式基因组浏览器IGV。

We would like to visualise processed alignments (bam and corresponding bai) at several loci with strong signal. We can also view the bam files split into nucleosome-free, mono- di- and tri- nucleosome fractions. Data has been mapped to hg38.我们希望在具有强信号的几个位点上可视化处理过的对齐(bam 和相应的 bai)。我们还可以看到将bam文件分成无核小体,单二核小体和三核小体部分。数据已映射到 hg38。

We will need the following files: atacseq/analysis/processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam and bai atacseq/analysis/QC/splitBam/NucleosomeFree.bam and bai atacseq/analysis/QC/splitBam/mononucleosome.bam and bai atacseq/analysis/QC/splitBam/dinucleosome.bam and bai * atacseq/analysis/QC/splitBam/trinucleosome.bam and bai

You will have to zoom in to view the alignments and coverage tracks.

Change the default viewing settings in IGV by shift-clicking onto a track name (left panel):

bam tracks: alignment view to Squished colour alignments to insert size and pair orientation

coverage tracks: * pay attention to track scale; it is set to Auto; the tracks won’t show at the same scale, you can harmonise the scale if you want to see the differences in signal height

You can move long the chromosome 14 and you will spot locations with high signal density.

Examples: chr14:61,513,568-61,543,546 chr14:90,365,635-90,395,613 chr14:92,508,638-92,538,616 chr14:93,095,621-93,125,599

An example is shown on the Figure below (we can skip discussing the peak intervals for now).

igv_qc_split1.png

Bonus question 额外问题 * Why does the NFR track show unusually high fraction of discordant alignments (labeled green)? 为什么 NFR 轨道显示异常高比例的不一致对齐(标记为绿色)?

After the QC performed in this tutorial and in general QC, we can now move to ATAC-seq data analysis 在本教程中执行 QC 和一般 QC 之后,我们现在可以转向 ATAC-seq 数据分析

Peak Calling

We will use the same data as before: ATAC-seq libraries prepared to analyse chromatin accessibility status in natural killer (NK) cells without and with stimulation (in duplicates) from the ENCODE project. 我们将使用与以前相同的数据:ATAC-seq文库准备分析自然杀伤(NK)细胞中的染色质可及性状态,而无需ENCODE项目的刺激(一式两份)。

Natural killer (NK) cells are innate immune cells that show strong cytolytic function against physiologically stressed cells such as tumor cells and virus-infected cells. NK cells express several activating and inhibitory receptors that recognize the altered expression of proteins on target cells and control the cytolytic function. To read more about NK cells please refer to Paul and Lal . The interleukin cocktail used to stimulate NK cells induces proliferation and activation (Lauwerys et al ). 自然杀伤(NK)细胞是先天免疫细胞,对生理应激细胞(如肿瘤细胞和病毒感染细胞)表现出很强的细胞溶解功能。NK细胞表达几种活化和抑制受体,这些受体识别靶细胞上蛋白质表达的改变并控制细胞溶解功能。要了解有关NK细胞的更多信息,请参阅Paul和Lal。用于刺激NK细胞的白介素混合物诱导增殖和活化(Lauwerys等人)。

ENCODE sample accession numbers are listed in Table 1.

Table 1. ENCODE accession numbers for data set used in this tutorial. | No | Accession | Cell type | Description | | ---- | ----------- | --------------------------------------------- | ------------------------------------------------------------ | | 1 | ENCFF398QLV | Homo sapiens natural killer cell female adult | untreated | | 2 | ENCFF363HBZ | Homo sapiens natural killer cell female adult | untreated | | 3 | ENCFF045OAB | Homo sapiens natural killer cell female adult | Interleukin-18, Interleukin-12-alpha, Interleukin-12-beta, Interleukin-15 | | 4 | ENCFF828ZPN | Homo sapiens natural killer cell female adult | Interleukin-18, Interleukin-12-alpha, Interleukin-12-beta, Interleukin-15 |

We have processed the data, starting from reads aligned to hg38 reference assembly using bowtie2. The alignments were obtained from ENCODE in bam format and further processed:

  • alignments were subset to include chromosome 14 and 1% of reads mapped to chromosomes 1 to 6 and chrM.

This allows you to see a realistic coverage of one selected chromosome and perform data analysis while allowing shorter computing times. Non-subset ATAC-seq data contains 100 - 200 M PE reads, too many to conveniently process during a workshop.

In this workshop, we have filtered and quality-controlled the data (parts Data preprocessing, General QC, and ATACseq specifc QC). To find regions corresponding to potential open chromatin, we want to identify ATAC-seq “peaks” where reads have piled up to a greater extent than the background read coverage.

The tools which are currently used are Genrich and MACS2.

  • Genrich has a mode dedicated to ATAC-Seq; however, Genrich is still not published; 目前使用的工具是Genrich和MACS2。 Genrich有一个专用于ATAC-Seq的模式;但是,Generich仍未出版;
  • MACS2 is widely used so lots of help is available online; however it is designed for ChIP-seq rather than ATAC-seq (MACS3 has more ATAC-seq oriented features, but still lacks a stable release); MACS2 被广泛使用,因此可以在线获得大量帮助;然而,它是为 ChIP-seq 而不是 ATAC-seq 设计的(MACS3 具有更多面向 ATAC-seq 的功能,但仍然缺乏稳定的版本);

The differences between these two peak callers in the context of ATAC-seq data are discussed here. 讨论了在 ATAC-seq 数据上下文中这两个峰值调用者之间的差异值得一看,还是用Genrich吧?

Shifting Alignments

We have already discussed (and performed) this step in the ATACseq specifc QC tutorial. Briefly, the alignments are shifted to account for the duplication created as a result of DNA repair after Tn5-introduced DNA nicks.我们已经在 ATACseq 特定的 QC 教程中讨论(并执行)了此步骤。简而言之,比对被转移以解释Tn5引入DNA切口后DNA修复产生的重复。

When Tn5 cuts an accessible chromatin locus it inserts adapters separated by 9bp, see Figure 2. This means that to have the read start site reflect the centre of where Tn5 bound, the reads on the positive strand should be shifted 4 bp to the right and reads on the negative strand should be shifted 5 bp to the left as in Buenrostro et al. 2013. 当Tn5切割可访问的染色质位点时,它会插入以9bp分隔的adapters,参见图2。这意味着要使reads开始位点反映Tn5结合的中心,正链上的reads应向右移动4 bp,而负链上的reads应向左移动5 bp,如Buenrostro等人2013年。重要信息,需要注意!

Figure 2. Nextera Library Construction.

To shift or not to shift? It, as always, depends on the downstream application of the alignments. 移位还是不移位? 与往常一样,这取决于对齐的下游应用。

If we use the ATAC-seq peaks for differential accessibility, and detect the peaks in the “broad” mode, then shifting does not play any role: the peaks are hundreds of bps long, reads are summarised to these peaks allowing a partial overlap, so 9 basepairs of difference has a neglibile effect. 如果我们使用ATAC-seq峰进行differential accessibility分析,并在“宽”模式下检测峰,则移位不会起任何作用:峰长数百bps,读数汇总到这些峰,允许部分重叠,因此9个差异碱基对具有忽略不计的影响。

However, when we plan to use the data for any nucleosome-centric analysis (positioning at TSS or TF footprinting), shifting the reads allows to center the signal in peaks around the nucleosomes and not directly on the nucleosome. 然而,当我们计划将数据用于任何以核小体为中心的分析**(定位在TSS或TF足迹)时,移动读数可以将信号集中在核小体周围的峰上,而不是直接在核小体上。

If we only assess the coverage of the start sites of the reads, the data would be too sparse and it would be impossible to call peaks. Thus, we will extend the start sites of the reads by 100bp (50 bp in each direction) to assess coverage. This is performed automatically by Genrich, and using command line options extsize and shift in MACS2. 如果我们只评估reads起始位点的覆盖范围,数据将过于稀疏,并且无法调用峰值。因此,我们将读取的起始位点延长 100bp(每个方向 50 bp)以评估覆盖范围。这是由Genrich自动执行的,并在MACS2中使用命令行选项“extsize”和“shift”。 这里因为我们的读长是150bp,或许需要将100bp换为150bp

Genrich

We start this tutorial in directory atacseq/analysis/:

mkdir peaks
cd peaks

We need to link necessary files first.

mkdir genrich
cd genrich

# we link the pre-processed bam file
ln -s ../../processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam ENCFF045OAB.chr14.proc.bam
ln -s ../../processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam.bai ENCFF045OAB.chr14.proc.bam.bai

Genrich requires bam files to be name-sorted rather than the default coordinate-sorted. Also, we remove all reference sequences other than chr14 from the header, as this is where our data is subset to. Genrich uses the reference sequence length from bam header in its statistical model, so retaining the original bam header would impair peak calling statistics. Grich要求对bam文件进行名称排序,而不是默认的坐标排序。此外,我们从标头中删除除 chr14 以外的所有参考序列,因为这是我们数据子集的位置。Grich在其统计模型中使用来自bam标头的引用序列长度,因此保留原始bam标头会损害峰值调用统计信息。

# in case not already loaded
module load bioinfo-tools
module load samtools/1.8


#subset bam and change header
samtools view -h ENCFF045OAB.chr14.proc.bam chr14 | grep -P "@HD|@PG|chr14" | samtools view -Shbo ENCFF045OAB.chr14.proc_rh.bam


# sort the bam file by read name (required by Genrich)
samtools sort -n -o ENCFF045OAB.chr14.proc_rh.nsort.bam -T sort.tmp  ENCFF045OAB.chr14.proc_rh.bam

Genrich can apply the read shifts when ATAC-seq mode -j is selected. We detect peaks by: Grich可以在选择ATAC-seq模式“-j”时应用读取移位。我们通过以下方式检测峰值:

/sw/courses/epigenomics/ATACseq_bulk/software/Genrich/Genrich -j -t ENCFF045OAB.chr14.proc_rh.nsort.bam  -o ENCFF045OAB.chr14.genrich.narrowPeak

The output file produced by Genrich is in ENCODE narrowPeak format, listing the genomic coordinates of each peak called and various statistics. 列出每个峰的基因组坐标和各种统计数据。

chr start end name score strand signalValue pValue qValue peak

signalValue - Measurement of overall (usually, average) enrichment for the region.
pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.

How many peaks were detected?

wc -l ENCFF045OAB.chr14.genrich.narrowPeak
1027 ENCFF045OAB.chr14.genrich.narrowPeak

ENCFF045OAB.chr14.genrich.narrowPeak

head ENCFF045OAB.chr14.genrich.narrowPeak
chr14   18332340        18333050        peak_0  490     .       347.820770      3.316712        -1      372
chr14   18583390        18584153        peak_1  1000    .       1267.254150     6.908389        -1      474
chr14   19061839        19062647        peak_2  732     .       591.112671      4.484559        -1      472
chr14   19337360        19337831        peak_3  1000    .       517.304626      4.484559        -1      373
chr14   19488517        19489231        peak_4  393     .       280.354828      2.916375        -1      210
chr14   20216750        20217291        peak_5  1000    .       625.121826      4.537151        -1      441

MACS2

MACS2 is widely used for peak calling in ATAC-seq, as evidenced by literature and many data processing pipelines. Several different peak calling protocols / commands have been encountered in various sources (and more combinations of parameters exist):MACS2 广泛用于 ATAC-seq 中的峰值调用,文献和许多数据处理管道证明了这一点。在各种来源中遇到了几种不同的峰值调用协议/命令(并且存在更多参数组合):

  1. macs narrow peak (default for callpeak), --nomodel, shifted reads using PE reads as SE (BED file);
  2. macs narrow peak, --nomodel, shifted reads, using PE reads (BEDPE file);
  3. macs narrow peak, --nomodel, using PE reads (BEDPE file); shift and extsize similar to Genrich;
  4. macs broad peak, using PE reads (BAMPE file) as in nf-core;
  5. macs narrow peak, --nomodel, using PE reads as SE (BAM file) as in Encode, shift and extsize similar to Genrich;
  6. macs narrow peak, unshifted reads in BAMPE file
  7. macs broad peak, BAM (PE reads used as SE reads)

The peaks obtained by these above commands have been compared to peaks detected by Genrich, and examples are shown on Figures 3 - 6. 通过上述命令获得的峰已与Genrich检测到的峰进行比较,示例如图3-6所示。

Figure 3 depicts large genomic region. In general Genrich detects less peaks (shown in green) than MACS2 (navy). MACS2 commands 1, 2 and 3 result in many peaks in regions where Genrich detects none. MACS commands 4, 5, 6 and 7 produce less peaks which are somewhat similar to the result of Genrich. If we zoom in, we can see that commands 1, 2 and 3 detect spurious peaks which do not have strong evidence in alignment pipeups (Figures 4 to 6). 图3描绘了大的基因组区域。一般来说,Grich检测到的峰值(以绿色显示)比MACS2(海军)少。MACS2 命令 1、2 和 3 在 Grich 未检测到的区域产生许多峰值。MACS 命令 4、5、6 和 7 产生的峰值较少,这与 Genrich 的结果有些相似。如果我们放大,我们可以看到命令1,2和3检测到在对齐管道中没有有力证据的杂散峰(图4至图6)。

Figure 3. Comparison of peaks detected by different algorithms. Overview of a large genomic region.

Figure 4. Comparison of peaks detected by different algorithms.

Figure 5. Comparison of peaks detected by different algorithms

Figure 6. Comparison of peaks detected by different algorithms

So which method to choose? You can test them all, for this tutorial we selected 4 and 5, which we found had most support in read coverage in regions we inspected yet still produced some spurious peaks. We’ll call the results broad and narrow, respectively. We use -g 107043718 for peak calling because this is the length of chr14, which is the only one included in the bam file. 那么选择哪种方法呢?您可以测试它们,在本教程中,我们选择了 4 和 5,我们发现在我们检查的区域的读取覆盖率方面支持最多,但仍会产生一些杂散峰。我们将结果分别称为“广泛”和“狭窄”。我们使用 '-g 107043718' 表示峰值调用,因为这是 chr14 的长度,这是 bam 文件中唯一包含的长度。

mkdir ../macs
cd ../macs

module load MACS/2.2.6

macs2 callpeak -t ../genrich/ENCFF045OAB.chr14.proc_rh.nsort.bam \
-n ENCFF045OAB.chr14.macs.broad --broad -f BAMPE \
-g 107043718 -q 0.1 --nomodel  --keep-dup all

macs2 callpeak -t ../genrich/ENCFF045OAB.chr14.proc_rh.nsort.bam \
-n ENCFF045OAB.chr14.macs.narrow -f BAM \
-g 107043718 -q 0.05 --nomodel --shift -75 --extsize 150 \
--call-summits --keep-dup all

How to shift reads in BED files

If you would like to test the effect of shifting reads, this how you do it on bed and bedpe files:

bedtools bamtobed -bedpe -i ENCFF045OAB.chr14.bam >ENCFF045OAB.chr14_pe.bed
bedtools bamtobed -i ENCFF045OAB.chr14.bam >ENCFF045OAB.chr14.bed

cat ENCFF045OAB.chr14.bed | awk -F $'\t' 'BEGIN {OFS = FS}{ if ($6 == "+") {$2 = $2 + 4} else if ($6 == "-") {$3 = $3 - 5} print $0}' >| ENCFF045OAB.chr14.proc.tn5.bed

cat ENCFF045OAB.chr14_pe.bed | awk -F $'\t' 'BEGIN {OFS = FS} {$2 = $2 + 4; $6 = $6 - 5; print $0}' >| ENCFF045OAB.chr14.proc.tn5.pe.bed

How many peaks were detected?

wc -l *Peak
 2011 ENCFF045OAB.chr14.macs.broad_peaks.broadPeak
 2428 ENCFF045OAB.chr14.macs.narrow_peaks.narrowPeak

Quite similar number of peaks for both methods, and double than what Genrich has detected. 两种方法的峰数非常相似,是Grich检测到的两倍。

ENCFF045OAB.chr14.macs.broad_peaks.broadPeak

head ENCFF045OAB.chr14.macs.broad_peaks.broadPeak
chr14   18674026        18674550        ENCFF045OAB.chr14.macs.broad_peak_1     21      .       3.15137 4.25082 2.12270
chr14   19096643        19097148        ENCFF045OAB.chr14.macs.broad_peak_2     31      .       3.61047 5.39085 3.19165
chr14   19098499        19098851        ENCFF045OAB.chr14.macs.broad_peak_3     16      .       2.95556 3.77788 1.68306
chr14   19105556        19105809        ENCFF045OAB.chr14.macs.broad_peak_4     27      .       3.42266 4.93515 2.76765
chr14   19161075        19162012        ENCFF045OAB.chr14.macs.broad_peak_5     24      .       3.29408 4.58492 2.42759
chr14   19172872        19173211        ENCFF045OAB.chr14.macs.broad_peak_6     22      .       3.19527 4.34825 2.20892

ENCFF045OAB.chr14.macs.narrow_peaks.narrowPeak

cd ..

module load BEDTools/2.25.0

bedtools intersect -a macs/ENCFF045OAB.chr14.macs.broad_peaks.broadPeak  -b genrich/ENCFF045OAB.chr14.genrich.narrowPeak  -f 0.50 -r >peaks_overlap.macs_broad.genrich.bed


bedtools intersect -a macs/ENCFF045OAB.chr14.macs.narrow_peaks.narrowPeak -b genrich/ENCFF045OAB.chr14.genrich.narrowPeak  -f 0.50 -r >peaks_overlap.macs_narrow.genrich.bed

wc -l *bed
   747 peaks_overlap.macs_broad.genrich.bed
 1613 peaks_overlap.macs_narrow.genrich.bed

Comparing results of MACS and Genrich

How many peaks actually overlap?

cd ..

module load BEDTools/2.25.0

bedtools intersect -a macs/ENCFF045OAB.chr14.macs.broad_peaks.broadPeak  -b genrich/ENCFF045OAB.chr14.genrich.narrowPeak  -f 0.50 -r >peaks_overlap.macs_broad.genrich.bed


bedtools intersect -a macs/ENCFF045OAB.chr14.macs.narrow_peaks.narrowPeak -b genrich/ENCFF045OAB.chr14.genrich.narrowPeak  -f 0.50 -r >peaks_overlap.macs_narrow.genrich.bed

wc -l *bed
   747 peaks_overlap.macs_broad.genrich.bed
 1613 peaks_overlap.macs_narrow.genrich.bed

Fraction of Reads in Peaks

Fraction of Reads in Peaks (FRiP) is one of key QC metrics of ATAC-seq data. According to ENCODE ATACseq data standards acceptable FRiP is >0.2. This value of course depends on the peak calling protocol, and as we have seen in the previous section, the results may vary …a lot. However, it is worth to keep in mind that any samples which show different value for this (and other) metric may be problematic in the analysis. 峰中读数分数(FRiP)是ATAC-seq数据的关键QC指标之一。根据ENCODE ATACseq数据标准可接受的FRiP为>0.2。这个值当然取决于峰值调用协议,正如我们在上一节中看到的,结果可能会有所不同......好多。但是,值得记住的是,任何显示此(和其他)指标不同值的样本在分析中都可能存在问题。

To calculate FRiP we need alignment file (bam) and peak file (narrowPeak, bed). 要计算FRiP,我们需要对齐文件(bam)和峰值文件(窄峰,床)。

Assuming we are in peaks:

mkdir frip
cd frip

We will use a tool called featureCounts from package Subread. This tool accepts genomic intervals in formats gtf/gff and saf. Let’s convert narrow/ broadPeak to saf:

awk -F $'\t' 'BEGIN {OFS = FS}{ $2=$2+1; peakid="macsBroadPeak_"++nr;  print peakid,$1,$2,$3,"."}' ENCFF045OAB.chr14.macs.broad_peaks.broadPeak > ENCFF045OAB.chr14.macs.broad.saf

ENCFF045OAB.chr14.macs.broad.saf

macsBroadPeak_1 chr14   18674027        18674550        .
macsBroadPeak_2 chr14   19096644        19097148        .
macsBroadPeak_3 chr14   19098500        19098851        .
macsBroadPeak_4 chr14   19105557        19105809        .
macsBroadPeak_5 chr14   19161076        19162012        .
macsBroadPeak_6 chr14   19172873        19173211        .

We can now summarise reads:

ln -s ../genrich/ENCFF045OAB.chr14.proc_rh.nsort.bam

module load subread/2.0.0
featureCounts -p -F SAF -a ENCFF045OAB.chr14.macs.broad.saf --fracOverlap 0.2 -o ENCFF045OAB.peaks_macs_broad.counts ENCFF045OAB.chr14.proc_rh.nsort.bam

This command has produced reads summarised within each peak (which we won’t use) and a summary file ENCFF045OAB.peaks_macs_broad.counts.summary which contains values we are interested in:

Status  ENCFF045OAB.chr14.proc_rh.nsort.bam
Assigned        409482
Unassigned_Unmapped     0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 0
Unassigned_Secondary    0
Unassigned_NonSplit     0
Unassigned_NoFeatures   1247672
Unassigned_Overlapping_Length   6742
Unassigned_Ambiguity    1

409482/(1247672+6742+1+409482) = 0.246 (i.e. 24.6%)

featureCounts` reported in the output to the screen (STDOUT) that 24.6% reads fall within peaks, and this is FRiP for sample ENCFF045OAB.

Consensus Peaks

As our experiment has been replicated, we can select the peaks which were detected in both replicates. This removes non-reproducible peaks in regions of low coverage and other artifacts.由于我们的实验已经重复,我们可以选择在两个重复中检测到的峰。这样可以消除低覆盖率区域和其他伪影区域中不可重现的峰。

In this section we will work on peaks detected earlier using son-subset data.在本节中,我们将使用子集数据处理之前检测到的峰。
First we link necessary files:首先,我们链接必要的文件:

mkdir consensus
cd consensus

ln -s ../../../results/peaks/ENCFF045OAB.macs.broad_peaks.broadPeak
ln -s ../../../results/peaks/ENCFF363HBZ.macs.broad_peaks.broadPeak
ln -s ../../../results/peaks/ENCFF398QLV.macs.broad_peaks.broadPeak
ln -s ../../../results/peaks/ENCFF828ZPN.macs.broad_peaks.broadPeak

To recap, ENCFF398QLV and ENCFF363HBZ are untreated and ENCFF045OAB and ENCFF828ZPN are stimulated NK cells.

Let’s select peaks which overlap at their 50% length in both replicates (assumind we are in consensus):

module load BEDTools/2.25.0

bedtools intersect -a ENCFF398QLV.macs.broad_peaks.broadPeak -b ENCFF363HBZ.macs.broad_peaks.broadPeak  -f 0.50 -r >nk_peaks.bed
bedtools intersect -a ENCFF045OAB.macs.broad_peaks.broadPeak -b ENCFF828ZPN.macs.broad_peaks.broadPeak  -f 0.50 -r >nk_stim_peaks.bed

How many peaks?

wc -l *Peak
   51425 ENCFF045OAB.macs.broad_peaks.broadPeak
   54258 ENCFF363HBZ.macs.broad_peaks.broadPeak
   54691 ENCFF398QLV.macs.broad_peaks.broadPeak
   72067 ENCFF828ZPN.macs.broad_peaks.broadPeak

How many overlap?

wc -l *bed
  47156 nk_peaks.bed
  36606 nk_stim_peaks.bed

Merged Peaks

We can now merge the consensus peaks into peak sets used for downstream analyses.

bedops -m nk_peaks.bed nk_stim_peaks.bed > nk_merged_peaks.bed

The format of nk_merged_peaks.bed is very simple, 3-field BED file. Let’s add peaks ids and convert it to saf:

awk -F $'\t' 'BEGIN {OFS = FS}{ $2=$2+1; peakid="nk_merged_macsBroadPeak_"++nr;  print peakid,$1,$2,$3,"."}' nk_merged_peaks.bed > nk_merged_peaks.saf

awk -F $'\t' 'BEGIN {OFS = FS}{ $2=$2+1; peakid="nk_merged_macsBroadPeak_"++nr;  print $1,$2,$3,peakid,"0","."}' nk_merged_peaks.bed > nk_merged_peaksid.bed

These files can now be used in peak annotation and in comparative analyses, for example differential accessibility analysis.

We can now follow with downstream analyses: Peak Annotation, Differential Accessibility and TF footprinting.

Peak Annotation

Learning outcomes

Using ChIPseeker package

to profile ATAC-seq signal by genomics location and by proximity to TSS regions

to annotate peaks with the nearest gene

to run and compare functional enrichment

Introduction

In this tutorial we use an R / Bioconductor package ChIPseeker, to have a look at the ATAC / ChIP profiles, annotate peaks and visualise annotations. We will also perform functional annotation of peaks using clusterProfiler.

The tutorial is based on the ChIPseeker package tutorial so feel free to have this open alongside to read and experiment more.

Data & Methods

We will build upon the main labs:

ATAC-seq: all detected peaks (merged consensus peaks);

ATAC-seq: differentially accessible peaks;

Setting-up

You can continue working in the analysis/peaks/consensus directory.

We need access to file nk_merged_peaksid.bed and annotation libraries, whcih are preinstalled. We access them via:

module load R_packages/4.1.1

We activate R console upon typing R in the terminal.

We begin by loading necessary libraries:

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

library(GenomicAlignments)
library(GenomicFeatures)

library(clusterProfiler)
library(biomaRt)
library(org.Hs.eg.db)
library(ReactomePA)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

Peaks Coverage Plot

After peak calling one may want to visualise distribution of peaks locations over the whole genome. Function covplot calculates coverage of peaks regions over chromosomes. 在峰调用之后,人们可能希望可视化整个基因组中峰位置的分布。函数covplot计算染色体上峰区域的覆盖率。

Let’s load in the data (peaks called on nun subset data) and transform the data frame to GenomicRanges object (GRanges):让我们加载数据(在 nun 子集数据上调用的峰值)并将数据框转换为 GenomicRange 对象 (GRanges):

pth2peaks_bed="nk_merged_peaksid.bed"

peaks.bed=read.table(pth2peaks_bed, sep="\t", header=FALSE, blank.lines.skip=TRUE)
rownames(peaks.bed)=peaks.bed[,4]

peaks.gr <- GRanges(seqnames=peaks.bed[,1], ranges=IRanges(peaks.bed[,2], peaks.bed[,3]), strand="*", mcols=data.frame(peakID=peaks.bed[,4]))

If you are not familiar with GRanges objects, this is how the structure is:

> peaks.gr
GRanges object with 55108 ranges and 1 metadata column:
          seqnames              ranges strand |           mcols.peakID
             <Rle>           <IRanges>  <Rle> |            <character>
      [1]     chr1         10004-10442      * | nk_merged_macsBroadP..
      [2]     chr1         28945-29419      * | nk_merged_macsBroadP..
      [3]     chr1       180755-181858      * | nk_merged_macsBroadP..
      [4]     chr1       191246-191984      * | nk_merged_macsBroadP..
      [5]     chr1       778381-779290      * | nk_merged_macsBroadP..
      ...      ...                 ...    ... .                    ...
  [55104]     chrX 155880949-155881831      * | nk_merged_macsBroadP..
  [55105]     chrX 155997337-155997938      * | nk_merged_macsBroadP..
  [55106]     chrX 156016628-156016865      * | nk_merged_macsBroadP..
  [55107]     chrX 156025099-156025486      * | nk_merged_macsBroadP..
  [55108]     chrX 156030206-156030747      * | nk_merged_macsBroadP..
  -------
  seqinfo: 82 sequences from an unspecified genome; no seqlengths

To inspect peak coverage along the chromosomes:

covplot(peaks.gr, chrs=c("chr14", "chr15"))

#to save the image to file
pdf("PeakCoverage.pdf")
covplot(peaks.gr, chrs=c("chr14", "chr15"))
dev.off()

Distribution of ATAC peaks along chromosomes.

Peak Annotation

To annotate peaks with closest genomic features:

bed.annot = annotatePeak(peaks.gr, tssRegion=c(-3000, 3000),TxDb=txdb, annoDb="org.Hs.eg.db")

Let’s inspect the results:

> bed.annot
Annotated peaks generated by ChIPseeker
54970/55108  peaks were annotated
Genomic Annotation Summary:
              Feature   Frequency
9    Promoter (<=1kb) 32.22484992
10   Promoter (1-2kb)  3.95488448
11   Promoter (2-3kb)  3.41095143
4              5' UTR  0.34018556
3              3' UTR  2.07931599
1            1st Exon  1.87011097
7          Other Exon  2.72330362
2          1st Intron 11.74458796
8        Other Intron 21.09150446
6  Downstream (<=300)  0.09459705
5   Distal Intergenic 20.46570857

Over 30% of peaks localise to TSS, as expected in an ATAC-seq experiment.

Let’s see peak annotations:

annot_peaks=as.data.frame(bed.annot)

This is the resulting data frame:

 head(annot_peaks)
  seqnames  start    end width strand              mcols.peakID
1     chr1  10004  10442   439      * nk_merged_macsBroadPeak_1
2     chr1  28945  29419   475      * nk_merged_macsBroadPeak_2
3     chr1 180755 181858  1104      * nk_merged_macsBroadPeak_3
4     chr1 191246 191984   739      * nk_merged_macsBroadPeak_4
5     chr1 778381 779290   910      * nk_merged_macsBroadPeak_5
6     chr1 817270 817490   221      * nk_merged_macsBroadPeak_6
                                                   annotation geneChr geneStart
1                                            Promoter (1-2kb)       1     11869
2                                            Promoter (<=1kb)       1     14404
3                                            Promoter (<=1kb)       1    182696
4 Intron (ENST00000623083.4/ENST00000623083.4, intron 1 of 9)       1    187891
5                                            Promoter (<=1kb)       1    725885
6                                            Promoter (<=1kb)       1    817371
  geneEnd geneLength geneStrand    geneId      transcriptId distanceToTSS
1   14409       2541          1 100287102 ENST00000456328.2         -1427
2   29570      15167          2    653635 ENST00000488147.1           151
3  184174       1479          1 102725121 ENST00000624431.2          -838
4  187958         68          2 102466751 ENST00000612080.1         -3288
5  778626      52742          2 100288069 ENST00000506640.2             0
6  819837       2467          1    400728 ENST00000326734.2             0
          ENSEMBL       SYMBOL                                    GENENAME
1 ENSG00000223972      DDX11L1  DEAD/H-box helicase 11 like 1 (pseudogene)
2 ENSG00000227232       WASH7P           WASP family homolog 7, pseudogene
3 ENSG00000223972     DDX11L17 DEAD/H-box helicase 11 like 17 (pseudogene)
4 ENSG00000278267    MIR6859-1                             microRNA 6859-1
5            <NA> LOC100288069                uncharacterized LOC100288069
6 ENSG00000177757       FAM87B family with sequence similarity 87 member B

It can be saved to a file:

write.table(annot_peaks, "nk_merged_annotated.txt",
        append = FALSE,
        quote = FALSE,
        sep = "\t",
        row.names = FALSE,
        col.names = TRUE,
        fileEncoding = "")

We can also visualise the annotation summary:

pdf("AnnotVis.pdf")
upsetplot(bed.annot, vennpie=TRUE)
dev.off()

Visualisation ofATACpeaks annotations.

Distribution of loci with respect to TSS:

pdf("TSSdist.pdf")
plotDistToTSS(bed.annot, title="Distribution of ATAC-seq peaks loci\nrelative to TSS")
dev.off()

Summary of ATAC-seq peaks relative to TSS.

Functional Analysis

Having obtained annotations to nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating knowledge provided by biological ontologies, e.g. GO (Gene Ontology, Ashburner et al. 2000) and Reactome (Croft et al. 2013).在获得最接近基因的注释后,我们可以进行功能富集分析,通过结合生物本体提供的知识来识别这些基因中的主要生物学主题,例如GO(Gene Ontology,Ashburner等人,2000)和Reactome(Croft等人,2013)。

In this tutorial we use the merged consensus peaks set. This analysis can also be performed on results of differential accessibility / occupancy.在本教程中,我们使用合并的共识峰集。该分析也可以对差异可达性/占用率的结果进行。

Let’s first annotate the peaks with Reactome.让我们首先用Reactome注释峰值。

Reactome pathway enrichment of genes defined as the nearest feature to the peaks: 被定义为最接近峰的特征的基因的反应组途径富集:

#finding enriched Reactome pathways using chromosome 1 and 2 genes as a background
pathway.reac <- enrichPathway(as.data.frame(annot_peaks)$geneId)

#previewing enriched Reactome pathways
head(pathway.reac)

This is the result (we skip column 8, as it is very broad - contains the gene IDs in set):

> colnames(as.data.frame(pathway.reac))
[1] "ID"          "Description" "GeneRatio"   "BgRatio"     "pvalue"
[6] "p.adjust"    "qvalue"      "geneID"      "Count"

> pathway.reac[1:10,c(1:7,9)]
                         ID                    Description GeneRatio   BgRatio
R-HSA-9012999 R-HSA-9012999               RHO GTPase cycle  407/8448 443/10856
R-HSA-9013149 R-HSA-9013149              RAC1 GTPase cycle  176/8448 185/10856
R-HSA-9013148 R-HSA-9013148             CDC42 GTPase cycle  151/8448 159/10856
R-HSA-4420097 R-HSA-4420097           VEGFA-VEGFR2 Pathway   97/8448  99/10856
R-HSA-194138   R-HSA-194138              Signaling by VEGF  105/8448 108/10856
R-HSA-9013404 R-HSA-9013404              RAC2 GTPase cycle   86/8448  88/10856
R-HSA-9013106 R-HSA-9013106              RHOC GTPase cycle   73/8448  74/10856
R-HSA-449147   R-HSA-449147      Signaling by Interleukins  402/8448 462/10856
R-HSA-8980692 R-HSA-8980692              RHOA GTPase cycle  137/8448 147/10856
R-HSA-5683057 R-HSA-5683057 MAPK family signaling cascades  287/8448 325/10856
                    pvalue     p.adjust       qvalue Count
R-HSA-9012999 7.645956e-16 1.161421e-12 8.893455e-13   407
R-HSA-9013149 4.636645e-11 3.521532e-08 2.696575e-08   176
R-HSA-9013148 1.771414e-09 8.969257e-07 6.868112e-07   151
R-HSA-4420097 6.278792e-09 2.384371e-06 1.825807e-06    97
R-HSA-194138  8.026320e-09 2.438396e-06 1.867176e-06   105
R-HSA-9013404 8.084405e-08 2.046702e-05 1.567240e-05    86
R-HSA-9013106 1.807641e-07 3.647964e-05 2.793389e-05    73
R-HSA-449147  1.921245e-07 3.647964e-05 2.793389e-05   402
R-HSA-8980692 3.786856e-07 6.391371e-05 4.894123e-05   137
R-HSA-5683057 5.815485e-07 8.479004e-05 6.492706e-05   287

We can see familar terms which can be connected to sample biology: Signaling by Interleukins, MAPK family signaling cascades.我们可以看到可以与样本生物学相关的熟悉术语:白细胞介素的信号传导,MAPK家族信号级联。

Let’s search for enriched GO terms:

pathway.GO <- enrichGO(as.data.frame(annot_peaks)$geneId, org.Hs.eg.db, ont = "MF")

These results look reasonable as well:

> pathway.GO[1:11,c(1,2,7)]
                   ID                                Description       qvalue
GO:0004674 GO:0004674   protein serine/threonine kinase activity 2.306759e-17
GO:0051020 GO:0051020                             GTPase binding 4.894211e-13
GO:0060090 GO:0060090                 molecular adaptor activity 5.215027e-13
GO:0030695 GO:0030695                  GTPase regulator activity 3.765106e-12
GO:0031267 GO:0031267                       small GTPase binding 8.221800e-12
GO:0030674 GO:0030674     protein-macromolecule adaptor activity 1.046226e-11
GO:0045296 GO:0045296                           cadherin binding 6.869553e-11
GO:0015631 GO:0015631                            tubulin binding 7.472641e-11
GO:0005085 GO:0005085 guanyl-nucleotide exchange factor activity 2.073955e-10
GO:0008017 GO:0008017                        microtubule binding 5.602393e-08
GO:0003712 GO:0003712         transcription coregulator activity 2.294003e-07

Differential Accessibility in ATAC-seq

Learning outcomes

to perform GC-aware normalisation of ATAC-seq data using EDASeq

to detect differentially accessible regions using edgeR

Introduction

In this tutorial we use an R / Bioconductor packages EDAseq and edgeR to perform normalisation and analysis of differential accessibility in ATAC-seq data.

Data & Methods

We will build upon the main lab ATACseq data analysis: - first we will summarise reads to detected peaks using subset data as an example to obtain a counts table; - we will use the counts table encompassing complete data for differential accessibility analysis;

Setting-up

We need access to file nk_merged_peaks.saf which we created in the ATACseq data analysis, part “Merged Peaks” and bam files with alignments.

Assuming we start at analysis:

mkdir counts_table
cd counts_table

ln -s ../peaks/consensus/nk_merged_peaks.saf
ln -s ../../data_proc/* .

Data Summarisation

We can now summarise the reads:

module load subread/2.0.0
featureCounts -p -F SAF -a nk_merged_peaks.saf --fracOverlap 0.2 -o nk_merged_peaks_macs_broad.counts ENCFF363HBZ.chr14.proc.bam ENCFF398QLV.chr14.proc.bam ENCFF828ZPN.chr14.proc.bam ENCFF045OAB.chr14.proc.bam

Let’s take a look inside the counts table using head nk_merged_peaks_macs_broad.counts.

# Program:featureCounts v2.0.0; Command:"featureCounts" "-p" "-F" "SAF" "-a" "nk_merged_peaks.saf" "--fracOverlap" "0.2" "-o" "nk_merged_peaks_macs_broad.counts" "ENCFF363HBZ.chr14.proc.bam" "ENCFF398QLV.chr14.proc.bam" "ENCFF828ZPN.chr14.proc.bam" "ENCFF045OAB.chr14.proc.bam"
Geneid  Chr     Start   End     Strand  Length  ENCFF363HBZ.chr14.proc.bam      ENCFF398QLV.chr14.proc.bam      ENCFF828ZPN.chr14.proc.bam      ENCFF045OAB.chr14.proc.bam
nk_merged_macsBroadPeak_1       chr1    10004   10442   .       439     5       10      15      20
nk_merged_macsBroadPeak_2       chr1    28945   29419   .       475     9       7       4       2
nk_merged_macsBroadPeak_3       chr1    180755  181858  .       1104    7       8       5       2
nk_merged_macsBroadPeak_4       chr1    191246  191984  .       739     0       3       1       0
nk_merged_macsBroadPeak_5       chr1    778381  779290  .       910     78      88      34      30
nk_merged_macsBroadPeak_6       chr1    817270  817490  .       221     0       0       0       0
nk_merged_macsBroadPeak_7       chr1    826976  827974  .       999     63      40      28      23
nk_merged_macsBroadPeak_8       chr1    838077  838576  .       500     0       1       0       0

We should remove the first line, as it can interfere with the way R reads in data:

awk '(NR>1)' nk_merged_peaks_macs_broad.counts > nk_merged_peaks_macs_broad.counts.tsv

Differential Accessibility

We activate R console upon typing R in the terminal.

We begin by loading necessary libraries:

library(edgeR)
library(EDASeq)

library(GenomicAlignments)
library(GenomicFeatures)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(wesanderson)

library(Hmisc)
library(dplyr)

txdb = TxDb.Hsapiens.UCSC.hg38.knownGene

ff = FaFile("/proj/epi2022/atacseq_proc/hg38ucsc/hg38.fa")


# We can read in the data, format it and define experimental groups:
cnt_table = read.table("../counts/nk_merged.macs_broad.counts", sep="\t", header=TRUE, blank.lines.skip=TRUE)
rownames(cnt_table)=cnt_table$Geneid

groups = factor(c(rep("NK",2),rep("NKstim",2)))

#this data frame contains only read counts to peaks on assembled chromosomes
reads.peak = cnt_table[,c(7:10)]


# We now prepare data with GC content of the peak regions for GC-aware normalisation.
gr = GRanges(seqnames=cnt_table$Chr, ranges=IRanges(cnt_table$Start, cnt_table$End), strand="*", mcols=data.frame(peakID=cnt_table$Geneid))

peakSeqs = getSeq(x=ff, gr)

gcContentPeaks = letterFrequency(peakSeqs, "GC",as.prob=TRUE)[,1]

#divide into 20 bins by GC content
gcGroups = Hmisc::cut2(gcContentPeaks, g=20)
mcols(gr)$gc = gcContentPeaks

Figure below shows that the accessibility measure of a particular genomic region is associated with its GC content. However, the slope and shape of the curves may differ between samples, which indicates that GC content effects are sample–specific and can therefore bias between–sample comparisons. 下图显示,特定基因组区域的可及性测量与其GC含量相关。然而,曲线的斜率和形状可能因样品而异,这表明GC含量效应是特定于样品的,因此可能会在样品比较之间产生偏差。

To visualise GC bias in peaks:

lowListGC = list()
for(kk in 1:ncol(reads.peak)){
  set.seed(kk)
  lowListGC[[kk]] = lowess(x=gcContentPeaks, y=log1p(reads.peak[,kk]), f=1/10)
}

names(lowListGC)=colnames(reads.peak)

dfList = list()
for(ss in 1:length(lowListGC)){
  oox = order(lowListGC[[ss]]$x)
  dfList[[ss]] = data.frame(x=lowListGC[[ss]]$x[oox], y=lowListGC[[ss]]$y[oox], sample=names(lowListGC)[[ss]])
}
dfAll = do.call(rbind, dfList)
dfAll$sample = factor(dfAll$sample)

p1.1 = ggplot(dfAll, aes(x=x, y=y, group=sample, color=sample)) +
  geom_line(size = 1) +
  xlab("GC-content") +
  ylab("log(count + 1)") +
  theme_classic()

pdf("GCcontent_peaks.pdf")
## plot just the average GC content
p1.1
dev.off()

Counts vs GC contents in ATAC-seq peaks. We can see that GC content has an effect on counts within the peaks.

We have seen from analyses presented on lecture slides (https://www.biorxiv.org/content/10.1101/2021.01.26.428252v2) that full quantile normalisation (FQ-FQ) implemented in package EDASeq is one of the methods which can mitigate the GC bias in detection of DA regions.

We’ll detect differentially accessible regions using edgeR. We will input the normalised GC content as offset to edgeR. 我们现在使用 edgeR 的统计框架。我们不像往常一样执行内部归一化 (TMM),而是提供由 EDASeq 计算的偏移量。

To calculate the offsets, which correct for library size as well as GC content (full quantile normalisation in both cases): 要计算偏移量,以校正文库大小和GC含量(两种情况下均为完全分位数归一化):

reads.peak=as.matrix(reads.peak)

dataOffset = withinLaneNormalization(reads.peak,y=gcContentPeaks,num.bins=20,which="full",offset=TRUE)
dataOffset = betweenLaneNormalization(reads.peak,which="full",offset=TRUE)

We now use the statistical framework of edgeR. We do not perform the internal normalisation (TMM) as usually, and instead we provide the offsets calculated by EDASeq. 我们现在使用 edgeR 的统计框架。我们不像往常一样执行内部归一化 (TMM),而是提供由 EDASeq 计算的偏移量。

design = model.matrix(~groups)

d = DGEList(counts=reads.peak, group=groups)

keep = filterByExpr(d)

> summary(keep)
        Mode   FALSE    TRUE
logical      21   54743


d=d[keep,,keep.lib.sizes=FALSE]

d$offset = -dataOffset[keep,]
d.eda = estimateGLMCommonDisp(d, design = design)
d.eda = estimateGLMCommonDisp(d, design = design)
fit = glmFit(d.eda, design = design)
lrt.EDASeq = glmLRT(fit, coef = 2)

DA_res=as.data.frame(topTags(lrt.EDASeq, nrow(lrt.EDASeq$table)))

The top DA peaks in stimulated vs non-stimulated NK cells:

> head(DA_res)

                                 logFC   logCPM       LR PValue FDR
nk_merged_macsBroadPeak_29593 7.743577 4.014714 1648.016      0   0
nk_merged_macsBroadPeak_9796  6.501470 4.527986 2801.485      0   0
nk_merged_macsBroadPeak_20351 6.490681 4.934009 3551.762      0   0
nk_merged_macsBroadPeak_12067 6.260759 4.441109 2593.194      0   0
nk_merged_macsBroadPeak_11203 6.165875 4.511952 2684.820      0   0
nk_merged_macsBroadPeak_53036 6.153595 4.023089 1922.240      0   0

Let’s add more peak information:

DA_res$Geneid = rownames(DA_res)
DA.res.coords = left_join(DA_res,cnt_table[1:4],by="Geneid")

We can check how well the GC correction worked:

gcGroups.sub=gcGroups[keep]
dfEdgeR = data.frame(logFC=log(2^lrt.EDASeq$table$logFC), gc=gcGroups.sub)

pedgeR = ggplot(dfEdgeR) +
  aes(x=gc, y=logFC, color=gc) +
  geom_violin() +
  geom_boxplot(width=0.1) +
  scale_color_manual(values=wesanderson::wes_palette("Zissou1", nlevels(gcGroups), "continuous")) +
  geom_abline(intercept = 0, slope = 0, col="black", lty=2) +
  ylim(c(-1,1)) +
  ggtitle("log2FCs in bins by GC content") +
  xlab("GC-content bin") +
  theme_bw()+
  theme(aspect.ratio = 1)+
  theme(axis.text.x = element_text(angle = 45, vjust = .5),
        legend.position = "none",
        axis.title = element_text(size=16))

ggsave(filename="log2FC_vs_GCcontent.pdf",plot=pedgeR ,path=".",device="pdf")

Dependence of log2FC on GC content in ATAC-seq.

It seems that FQ-FQ normalisation did not completely remove the effect of GC content on log2FC in thie dataset. However, these effects are somewhat mitigated, you can compare this plot to one obtained by using the standard TMM normalisation.

The reason why the GC effects are not completely removed in this case may be that the DA analysis is not performed on properly replicated data; we should have at least 3 replicates per condition, and we only have two.

Transcription Factor Footprinting

Learning outcomes

detect transcription factor binding signatures in ATAC-seq data

Figure 1. Overview of TF footprinting in ATAC-seq data.

Introduction

While ATAC-seq can uncover accessible regions where transcription factors (TFs) might bind, reliable identification of specific TF binding sites (TFBS) still relies on chromatin immunoprecipitation methods such as ChIP-seq. ChIP-seq methods require high input cell numbers, are limited to one TF per assay, and are further restricted to TFs for which antibodies are available. Therefore, it remains costly, or even impossible, to study the binding of multiple TFs in one experiment. 虽然 ATAC-seq 可以发现转录因子 (TF) 可能结合的可接近区域,但特定 TF 结合位点 (TFBS) 的可靠识别仍然依赖于染色质免疫沉淀方法,例如 ChIP-seq。 ChIP-seq 方法需要高输入细胞数,每次检测仅限于一个 TF,并且进一步仅限于可使用抗体的 TF。因此,在一项实验中研究多个转录因子的结合仍然是昂贵的,甚至是不可能的。

Similarly to nucleosomes, bound TFs hinder cleavage of DNA, resulting in footprints: defined regions of decreased signal strength within larger regions of high signal (Figure 1.).与核小体类似,结合的 TF 会阻碍 DNA 的裂解,从而产生足迹:在较大的高信号区域内确定信号强度降低的区域(图 1)。

Despite its compelling potential, a number of issues have rendered footprinting of ATAC-seq data cumbersome. It has been described that enzymes used in chromatin accessibility assays (e.g., DNase-I, Tn5) are biased towards certain sequence compositions. If unaccounted for, this impairs the discovery of true TF footprints.尽管其潜力巨大,但一些问题使得 ATAC-seq 数据的足迹变得很麻烦。据描述,染色质可及性测定中使用的酶(例如 DNase-I、Tn5)偏向于某些序列组成。如果下落不明,就会影响真正的 TF 足迹的发现。

In this tutorial we use an R / Bioconductor packages ATACseqQC and MotifDb to detect TF binding signatures in ATAC-seq data. Please note this tutorial is merely an early attempt to determine whether TF bindng sites can be identified in ATAC-seq data, and not a statistical framework for TF footrpinting. It can be used as a QC step to detect expected TF sites in the data rather than for discovery of novel binding patterns.在本教程中,我们使用 R / Bioconductor 包 ATACseqQC 和 MotifDb 来检测 ATAC-seq 数据中的 TF 结合特征。请注意,本教程只是确定是否可以在 ATAC-seq 数据中识别 TF 结合位点的早期尝试,而不是 TF 足迹分析的统计框架。它可以用作 QC 步骤来检测数据中预期的 TF 位点,而不是用于发现新的结合模式。

Data & Methods

We will build upon the lab ATACseq specifc QC and use the same data as for other ATAC-seq labs.我们将以实验室ATACseq特定的QC为基础,并使用与其他ATAC-seq实验室相同的数据。

Setting-up

We need access to bam file with shifted alignments shifted.bam which we created in the ATACseq specifc QC. This file contains alignments shifted +4 bps on the + strand and -5 bps on the - strand, to account for Tn5 transposition.我们需要访问我们在 ATACseq 特定 QC 中创建的具有移位对齐方式的 bam 文件。该文件包含 + 链上移位 +4 bps 和 - 链上移位 -5 bps 的对齐,以考虑 Tn5 转置。

Assuming we start at analysis:

mkdir TF_footprnt
cd TF_footprnt

ln -s ../QC/splitBam/shifted.bam .
ln -s ../QC/splitBam/shifted.bam.bai .

module load R_packages/4.1.1

Please check first that file shifted.bam exists in this location: ls ../QC/splitBam/shifted.bam. If it does, the output of this command is the path; if it does not you get “file does not exist”. Depending on the directory structure, you may need to link the file like this:請先檢查文件shifted.bam是否存在於這個位置:ls▁../QC/splitBam/shifted.bam。如果存在,此命令的輸出是路徑;如果不存在,則會得到「file▁does▁not▁exist」。根據目錄結構,您可能需要像這樣鏈接文件:

ln -s ../../QC/splitBam/shifted.bam .
ln -s ../../QC/splitBam/shifted.bam.bai .

Detection of TF Binding Signatures

We begin by loading necessary libraries:

library(ATACseqQC)
library(MotifDb)

library(BSgenome.Hsapiens.UCSC.hg38)
genome <- Hsapiens

Let’s first check signatures for a general TF CTCF. This is its motif as position weight matrix (PWM):

CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
print(CTCF[[1]], digits=2)

We now summarise the signal in the vicinity of CTCF motifs (100 bps up- and down-stream):

ctcf <- factorFootprints(shifted.bamFile, pfm=CTCF[[1]],
                         genome=genome,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)

This function outputs signal mean values of coverage for positive strand and negative strand in feature regions, and other information which you can inspect using str(ctcf):此函数输出特征区域中正链和负链覆盖率的信号平均值,以及您可以使用str(ctcf)检查的其他信息: - spearman.correlation spearman correlations of cleavage counts in the highest 10-nucleotide-window斯皮尔曼相关性 最高10核苷酸窗口中切割计数的斯皮尔曼相关性 - bindingSites - GRanges object with detected bindng sites绑定站点 - 具有检测到的绑定站点的 GRanges 对象 - Profile.segmentation配置文件.分段

Let’s inspect the statistics ctcf$spearman.correlation:让我们检查一下统计数据ctcf$spearman.correlation:

> ctcf$spearman.correlation
$`+`

        Spearman's rank correlation rho

data:  predictedBindingSiteScore and highest.sig.windows
S = 3761558200, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho
0.2653951


$`-`

        Spearman's rank correlation rho

data:  predictedBindingSiteScore and highest.sig.windows
S = 3801340300, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho
0.2576259

The plot produced by this function is of signal mean values of coverage for positive strand and negative strand in feature regions.

pdf("ctcf_footprnt.pdf")
sigs <- factorFootprints(shifted.bamFile, pfm=CTCF[[1]],
                         genome=genome,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)
dev.off()

We can generate similar plots for other TFs.

RFX5

RFX5 <- query(MotifDb, c("RFX5"))
RFX5 <- as.list(RFX5)


rfx5 <- factorFootprints(shifted.bamFile, pfm=RFX5[[1]],
                         genome=genome,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)

 rfx5$spearman.correlation$`+`$estimate
 rfx5$spearman.correlation$`+`$p.value


pdf("rfx5_footprnt.pdf")
rfx5 <- factorFootprints(shifted.bamFile, pfm=RFX5[[1]],
                         genome=genome,
                         min.score="90%", seqlev=seqlev,
                         upstream=100, downstream=100)
dev.off()

STAT3

STAT3 <- query(MotifDb, c("STAT3"))
STAT3 <- as.list(STAT3)


stat3 <- factorFootprints(shifted.bamFile, pfm=STAT3[[1]],
                                 genome=genome,
                                 min.score="90%", seqlev=seqlev,
                                 upstream=100, downstream=100)

stat3$spearman.correlation$`+`$estimate
stat3$spearman.correlation$`+`$p.value

stat3$Profile.segmentation

pdf("stat3_footprnt.pdf")
stat3 <- factorFootprints(shifted.bamFile, pfm=STAT3[[1]],
                                 genome=genome,
                                 min.score="90%", seqlev=seqlev,
                                 upstream=100, downstream=100)
dev.off()

Which factors show evidence of binding enrichment in this data set?

Figure 2. Examples of TF footprints.

CTCF RFX5 STAT3

image source: https://doi.org/10.1038/s41467-020-18035-1 (Figure 1.)

⭐️scATAC-seq