Question

Capping coverage in bam file

10

Entering edit mode

9.7 years ago

Danielk ▴ 640

Colleagues,

I have some BAM files with excessive coverage in certain regions that I want to cap to a given coverage. In regions with lower coverage, I do not want the coverage to change. Here's an illustration of what I want:

I want read pairs to be intact in the output file.

Does anyone have a solution to this? I want to output a BAM file for other tools do use. Google has not been my friend.

(note: GATK PrintReads using -dcov looked promising until a realised that "For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position" from https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--downsample_to_coverage. Close but no CIGAR (pun intended)).

downsample bam coverage • 12k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Danielk ▴ 640

1

Entering edit mode

You can achieve a very similar effect by normalizing the fastq data prior to mapping. This randomly discards reads with high kmer coverage to achieve a specified limit. For example, with BBNorm:

bbnorm.sh in=reads.fq out=normalized.fq min=0 target=100

This will flatten peaks with over 100x coverage, resulting in capped coverage as in your diagram. The output will start to diverge from what you would get by reducing coverage based on actual mapping results if you have a highly repetitive genome, but on a mammalian or bacterial genomes it should be very similar. As a result, it's particularly useful if you don't have a reference, or if you have too much data to map - normalization is faster than mapping to a large reference.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

We cannot see the picture, please use something else than dropbox(?), e.g., http://imgur.com/

Why is "excessive coverage" a problem?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for the heads up with the image - it now lives on imgur.

Excessive coverage in certain regions causes some downstream tools to be slow and increase RAM usage. I'd like to fix this by capping the coverage in the file, rather than downsampling, since I'm happy with the coverage in most parts of the genome.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.7 years ago by Danielk ▴ 640

0

Entering edit mode

Do you have a BED with the regions of interest?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I guess I could generate a BED of the regions with high coverage for each BAM if that helps. So, sure-ish. :)

How does it help?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.7 years ago by Danielk ▴ 640

1

Entering edit mode

9.7 years ago

Christian ★ 3.1k

You could also use "digital normalization" to address this problem https://arxiv.org/abs/1203.4802.

ADD COMMENT • link updated 8.3 years ago by Ian 6.1k • written 9.7 years ago by Christian ★ 3.1k

Ram · Accepted Answer · 2015-08-12

8

Entering edit mode

9.7 years ago

Pierre Lindenbaum 166k

I wrote two tools:

The first sorts a BAM on REF/read-name https://github.com/lindenb/jvarkit/wiki/SortSamRefName
The second reads the BAM above. For each set of reads having the same name, it checks if after adding those reads, the depth would be lower than the expected depth. If true, those reads are added to the BAM. Edit: doc is https://github.com/lindenb/jvarkit/wiki/Biostar154220

$ java -jar dist/sortsamrefname.jar -T bam /input.bam  |\
  java -jar dist/biostar154220.jar  -n 20 -T bam |\
  samtools sort - output

$ samtools mpileup output.bam  | cut -f 4 | sort | uniq -c
  12692 0
 596893 1
  94956 10
  56715 11
  76947 12
  57912 13
  66585 14
  51961 15
  63184 16
  47360 17
  65189 18
  65014 19
 364524 2
 169064 20
  72078 3
 118288 4
  54802 5
  82555 6
  53175 7
  78474 8
  54052 9

~~(what are those 21 and 22 ? I suppose it's because I don't consider supplementary, secondary, duplicate reads )~~ (bug fixed). Anyway, can you tell me if that's what you need?

Pierre

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Brilliant! I'll give it a spin with some real data and get back to you.

ADD REPLY • link 9.7 years ago by Danielk ▴ 640

1

Entering edit mode

small update: reads with MAPQ=0 are ignored now. https://github.com/lindenb/jvarkit/commit/484ae4fc0e3333893aa6f343db8774f0ede40dda#diff-98f5c184931195eb78ac6dbdaa58f976R165

ADD REPLY • link 9.7 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

This works great. Here's a region where the bam in the lower panel was capped to 200x. (It's a bit hard to see, but the range in the upper panel is 0-2000, and the lower 0-200).

An interesting effect in that in the two slightly lower peaks in input.bam, the original coverage is 1000+, but in output it's not 200, but 160 in the left peak, and around 90 in the right peak (rather than the set cap of 200x). I'm assuming that the reason is related to the read pairs insert size (here around 450), since they are added as pairs. I've read the source but haven't been able to understand the implications of the algorithm completely.

input.bam: https://dl.dropboxusercontent.com/u/1769569/input.bam
output.bam: https://dl.dropboxusercontent.com/u/1769569/output.bam

Commands:

~/bin/jvarkit/dist-1.133/sortsamrefname -T bam /proj/b2010040/input.bam |\
  ~/bin/jvarkit/dist-1.133/biostar154220 -n 200 -T bam |\
  ~/bin/autoseq/tools/bin/samtools sort /dev/stdin /proj/b2010040/output

Edit: For future reference, input.bam contains reads mapped to M on GRCh37.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Danielk ▴ 640

1

Entering edit mode

Thanks for letting me know that it works ( i want to be the first author of any paper ! :-P )

ADD REPLY • link 9.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hey! You have amazing tools, thank you! But I'm struggling to do this exactly as you say:

$ java -jar ~/Programs/jvarkit/dist/sortsamrefname.jar -T bam /input.bam

[SEVERE][Launcher]There was an error in the command line arguments. Please check your parameters : Expected one or zero argument but got 3 : [-T, bam, /input.bam]

Should be something basic! Cheers,

ADD REPLY • link 6.5 years ago by ricardoguerreiro2121 ▴ 80

1

Entering edit mode

the tool has changed since this post: http://lindenb.github.io/jvarkit/SortSamRefName.html

try:

 java -jar ~/Programs/jvarkit/dist/sortsamrefname.jar  input.bam

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

So fast, thanks! It was basic, I was tired. Now I have a more relevant question. Do I have to create a sequence diccionary for this to work?

Ignoring SAM validation error due to lenient parsing: Error parsing text SAM file. RG ID on SAMRecord not found in header: 10x; File tmp.bam; Line 6 Line: J00137:124:HV5G7BBXX:4:1121:15016:43902 147 1 5 0 120S27M4S = 5 -27 GGTGGTGGGTGGTTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTTGTGTTTAGATGTGGCGGTGTGTTTATGCTCTGGGTCACATTTAAAGGTCAAATGTGACCCAGAGCATAAA 7)--)))-----F-JJ7FA-JJJFFJAJJJJFFFFJAAJJJFJJJJFJJFFFJJJJJJJ77JAJFJJF7AAJJJFJFAFAAA MD:Z:27 PG:Z:MarkDuplicates RG:Z:10x NM:i:0 AS:i:27 XS:i:27 [SEVERE][SortSamRefName]Reference index for '1' not found in sequence dictionary. java.lang.IllegalArgumentException: Reference index for '1' not found in sequence dictionary.

ADD REPLY • link 6.5 years ago by ricardoguerreiro2121 ▴ 80

1

Entering edit mode

I imagine you're working with a bam without header. Most/All mapping tools include a sequence dictionary. This read is mapped on contig '1' but there is no '1' in the header.

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Once again simple, merci bien!

Keep up the helpfulness, it really makes a difference for starting researchers like me.

ADD REPLY • link 6.5 years ago by ricardoguerreiro2121 ▴ 80