Entering edit mode
10.3 years ago
ifreecell
▴
220
Hi there,
I have a lot of duplicate reads in my RNA-seq, which is great from my view. My question is is there any way to get the coordinates of these duplicate reads (> 10 reads) from a BAM/SAM file? It will be perfect to be able to annotate these regions with overlapping or flanking genes and output the results in a bed format. Any suggestions are appreciated, but solutions using R packages are more preferred.
Here are some artificial example datasets (might be deleted after one year) to be tested.
If you are only interested in retrieving list of genes which have lots of duplicate reads then just compare the read count files generated using bam file before and after removal or flagging of duplicate reads. Read count softwares like HTseq will consider overlapping reads towards the counts. You can then compare these two count files and extract genes that show reduction of more than 10 counts between these two files. Retrieving the intergenic regions that show duplication is not that straightforward but can be accomplished using bedtools.
What have you tried? It'd be possible to do this in any language with convenient access to BAM files (C, python, R, java, etc.).
Could you just give some clue?
The general structure would be to read in a coordinate sorted BAM file and then create a buffer in a language of your choosing to hold one alignment. You'd then iterate through the alignments checking if a given alignment matches that in the buffer. If it does, increment a counter. If the alignment doesn't match that in the buffer, then look at the counter and see if there are enough duplicates to note the location. This would then be repeated. This is easy enough with single-end reads, but can be a bit more complicated with paired-end reads, depending on how you want to define duplicates.