I'm currently trying to calculate the number of single-cell ATAC fragments that lie at peaks defined by a bed file.
I have a tsv file containing fragments with the following columns: chromosome, start, end, cell barcode. This file is around 10gb in size (around 250 million rows).
I also have another tsv file containing peaks with the following columns: chromosome, start, end.
What I want to do is: 1. Assign fragment to each peak on bed file 2. Output a data frame containing cell barcode, region where the fragments are assigned (chr, start, end), and also the number of fragments assigned to this region.
Here is what I tried, however R ran out of memory (I am running a desktop with 64gb memory and an i7 9700k). It also seems very slow. Maybe R is not suited for this kind of task, but currently I am most familiar with R.
In summary what I did was: loop over fragments and filter the bed file if the fragment lies in between the region of the bed file, and then append another column to the fragments so that I have the region identifier, and finally count the unique identifier for each cell.
fragments$region <- "none"
for(i in 1:nrow(fragments)){
chr_fragment <- fragments[i,] %>% pull(chr)
start_fragment <- fragments[i,] %>% pull(start)
end_fragment <- fragments[i,] %>% pull(end)
# region containing start of fragment
extracted_region <- bed_file %>% filter(chr == chr_fragment) %>% filter(start <= start_fragment & start_fragment <=
end) %>% paste(collapse = "_")
extracted_region <- paste0("chr", extracted_region)
# append to initial fragments tsv file
fragments[i,5] <- extracted_region
paste0("finished assigning ", i, "out of ", nrow(fragments), " fragments") %>% print()
}
fragments <- rename(fragments, region = 5)
# remove fragments not in peak
fragments <- fragments %>% filter(region != "none")
# count number of fragments in region per cell
fragments <- fragments %>% mutate(dplyr::count(fragments, barcode, region))
Is it possible to manipulate this in a way that it won't take too much memory? Or is it impossible to achieve using R? Or if there is any tool that can do this (I tried to look but did not find any that suits this). It also seems to take awfully long time to perform this loop, which I don't even think will finish in weeks. Any help is greatly appreciated!
Why can't you just use your bed file with your alignment files (BAMs) by using featureCounts software to count the number of fragments in each peak region?
In scATAC, there is no bam file per cell. Its barcode based. FeatureCounts needs individual bam files per cell.