Question

Counting single-cell ATAC fragments at peaks on BED file

1

Entering edit mode

4.4 years ago

Nico ▴ 10

I'm currently trying to calculate the number of single-cell ATAC fragments that lie at peaks defined by a bed file.

I have a tsv file containing fragments with the following columns: chromosome, start, end, cell barcode. This file is around 10gb in size (around 250 million rows).

I also have another tsv file containing peaks with the following columns: chromosome, start, end.

What I want to do is: 1. Assign fragment to each peak on bed file 2. Output a data frame containing cell barcode, region where the fragments are assigned (chr, start, end), and also the number of fragments assigned to this region.

Here is what I tried, however R ran out of memory (I am running a desktop with 64gb memory and an i7 9700k). It also seems very slow. Maybe R is not suited for this kind of task, but currently I am most familiar with R.

In summary what I did was: loop over fragments and filter the bed file if the fragment lies in between the region of the bed file, and then append another column to the fragments so that I have the region identifier, and finally count the unique identifier for each cell.

fragments$region <- "none"

for(i in 1:nrow(fragments)){

chr_fragment <- fragments[i,] %>% pull(chr)
start_fragment <- fragments[i,] %>% pull(start)
end_fragment <- fragments[i,] %>% pull(end)

# region containing start of fragment 
extracted_region <- bed_file %>% filter(chr == chr_fragment) %>% filter(start <= start_fragment & start_fragment <= 
end) %>% paste(collapse = "_") 

extracted_region <- paste0("chr", extracted_region)

# append to initial fragments tsv file
fragments[i,5] <- extracted_region

paste0("finished assigning ", i, "out of ", nrow(fragments), " fragments") %>% print()
}


fragments <- rename(fragments, region = 5)

# remove fragments not in peak
fragments <- fragments %>% filter(region != "none")

# count number of fragments in region per cell
fragments <- fragments %>% mutate(dplyr::count(fragments, barcode, region))

Is it possible to manipulate this in a way that it won't take too much memory? Or is it impossible to achieve using R? Or if there is any tool that can do this (I tried to look but did not find any that suits this). It also seems to take awfully long time to perform this loop, which I don't even think will finish in weeks. Any help is greatly appreciated!

R atac-seq • 5.4k views

ADD COMMENT • link updated 4.4 years ago by GouthamAtla 12k • written 4.4 years ago by Nico ▴ 10

0

Entering edit mode

Why can't you just use your bed file with your alignment files (BAMs) by using featureCounts software to count the number of fragments in each peak region?

ADD REPLY • link 4.4 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

In scATAC, there is no bam file per cell. Its barcode based. FeatureCounts needs individual bam files per cell.

ADD REPLY • link 4.4 years ago by GouthamAtla 12k

score 6 · Accepted Answer · 2020-07-07

6

Entering edit mode

4.4 years ago

GouthamAtla 12k

I am not sure if you have looked up for any tools for scATAC analysis. Signac, part of Seurat framework is developed for chromatin data set analysis, especially scATAC.

The FeatureMatrix function does exactly what you are looking for.

FeatureMatrix(
  fragments,
  features,
  cells = NULL,
  chunk = 50,
  sep = c("-", "-"),
  verbose = TRUE
)

fragments are your fragment file and features are your peaks as a GRanges object.

ADD COMMENT • link 4.4 years ago by GouthamAtla 12k

0

Entering edit mode

Hi, thanks a lot, that is exactly what I was looking for. I actually used Signac before but was unaware of this function. I just did not know what and where to search for this.

ADD REPLY • link 4.4 years ago by Nico ▴ 10

0

Entering edit mode

If an answer was helpful to you, please indicate that by upvoting it

ADD REPLY • link 4.4 years ago by Friederike 9.0k

0

Entering edit mode

Accept the answer so that it wont bump in the future as unanswered.

ADD REPLY • link 4.4 years ago by GouthamAtla 12k