I have some BAM files with excessive coverage in certain regions that I want to cap to a given coverage. In regions with lower coverage, I do not want the coverage to change. Here's an illustration of what I want:
I want read pairs to be intact in the output file.
Does anyone have a solution to this? I want to output a BAM file for other tools do use. Google has not been my friend.
You can achieve a very similar effect by normalizing the fastq data prior to mapping. This randomly discards reads with high kmer coverage to achieve a specified limit. For example, with BBNorm:
This will flatten peaks with over 100x coverage, resulting in capped coverage as in your diagram. The output will start to diverge from what you would get by reducing coverage based on actual mapping results if you have a highly repetitive genome, but on a mammalian or bacterial genomes it should be very similar. As a result, it's particularly useful if you don't have a reference, or if you have too much data to map - normalization is faster than mapping to a large reference.
Thanks for the heads up with the image - it now lives on imgur.
Excessive coverage in certain regions causes some downstream tools to be slow and increase RAM usage. I'd like to fix this by capping the coverage in the file, rather than downsampling, since I'm happy with the coverage in most parts of the genome.
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.4 years ago by
Danielk
▴
640
The second reads the BAM above. For each set of reads having the same name, it checks if after adding those reads, the depth would be lower than the expected depth. If true, those reads are added to the BAM. Edit: doc is https://github.com/lindenb/jvarkit/wiki/Biostar154220
(what are those 21 and 22 ? I suppose it's because I don't consider supplementary, secondary, duplicate reads )(bug fixed). Anyway, can you tell me if that's what you need?
This works great. Here's a region where the bam in the lower panel was capped to 200x. (It's a bit hard to see, but the range in the upper panel is 0-2000, and the lower 0-200).
An interesting effect in that in the two slightly lower peaks in input.bam, the original coverage is 1000+, but in output it's not 200, but 160 in the left peak, and around 90 in the right peak (rather than the set cap of 200x). I'm assuming that the reason is related to the read pairs insert size (here around 450), since they are added as pairs. I've read the source but haven't been able to understand the implications of the algorithm completely.
[SEVERE][Launcher]There was an error in the command line arguments. Please check your parameters : Expected one or zero argument but got 3 : [-T, bam, /input.bam]
So fast, thanks! It was basic, I was tired. Now I have a more relevant question. Do I have to create a sequence diccionary for this to work?
Ignoring SAM validation error due to lenient parsing:
Error parsing text SAM file. RG ID on SAMRecord not found in header: 10x; File tmp.bam; Line 6
Line: J00137:124:HV5G7BBXX:4:1121:15016:43902 147 1 5 0 120S27M4S = 5 -27 GGTGGTGGGTGGTTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTTGTGTTTAGATGTGGCGGTGTGTTTATGCTCTGGGTCACATTTAAAGGTCAAATGTGACCCAGAGCATAAA 7)--)))-----F-JJ7FA-JJJFFJAJJJJFFFFJAAJJJFJJJJFJJFFFJJJJJJJ77JAJFJJF7AAJJJFJFAFAAA MD:Z:27 PG:Z:MarkDuplicates RG:Z:10x NM:i:0 AS:i:27 XS:i:27
[SEVERE][SortSamRefName]Reference index for '1' not found in sequence dictionary.
java.lang.IllegalArgumentException: Reference index for '1' not found in sequence dictionary.
I imagine you're working with a bam without header. Most/All mapping tools include a sequence dictionary. This read is mapped on contig '1' but there is no '1' in the header.
You can achieve a very similar effect by normalizing the fastq data prior to mapping. This randomly discards reads with high kmer coverage to achieve a specified limit. For example, with BBNorm:
This will flatten peaks with over 100x coverage, resulting in capped coverage as in your diagram. The output will start to diverge from what you would get by reducing coverage based on actual mapping results if you have a highly repetitive genome, but on a mammalian or bacterial genomes it should be very similar. As a result, it's particularly useful if you don't have a reference, or if you have too much data to map - normalization is faster than mapping to a large reference.
We cannot see the picture, please use something else than dropbox(?), e.g., http://imgur.com/
Why is "excessive coverage" a problem?
Thanks for the heads up with the image - it now lives on imgur.
Excessive coverage in certain regions causes some downstream tools to be slow and increase RAM usage. I'd like to fix this by capping the coverage in the file, rather than downsampling, since I'm happy with the coverage in most parts of the genome.
Do you have a BED with the regions of interest?
I guess I could generate a BED of the regions with high coverage for each BAM if that helps. So, sure-ish. :)
How does it help?