Capping coverage in bam file
2
10
Entering edit mode
9.3 years ago
Danielk ▴ 640

Colleagues,

I have some BAM files with excessive coverage in certain regions that I want to cap to a given coverage. In regions with lower coverage, I do not want the coverage to change. Here's an illustration of what I want:

I want read pairs to be intact in the output file.

Does anyone have a solution to this? I want to output a BAM file for other tools do use. Google has not been my friend.

(note: GATK PrintReads using -dcov looked promising until a realised that "For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position" from https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--downsample_to_coverage. Close but no CIGAR (pun intended)).

downsample bam coverage • 12k views
ADD COMMENT
1
Entering edit mode

You can achieve a very similar effect by normalizing the fastq data prior to mapping. This randomly discards reads with high kmer coverage to achieve a specified limit. For example, with BBNorm:

bbnorm.sh in=reads.fq out=normalized.fq min=0 target=100

This will flatten peaks with over 100x coverage, resulting in capped coverage as in your diagram. The output will start to diverge from what you would get by reducing coverage based on actual mapping results if you have a highly repetitive genome, but on a mammalian or bacterial genomes it should be very similar. As a result, it's particularly useful if you don't have a reference, or if you have too much data to map - normalization is faster than mapping to a large reference.

ADD REPLY
0
Entering edit mode

We cannot see the picture, please use something else than dropbox(?), e.g., http://imgur.com/

Why is "excessive coverage" a problem?

ADD REPLY
0
Entering edit mode

Thanks for the heads up with the image - it now lives on imgur.

Excessive coverage in certain regions causes some downstream tools to be slow and increase RAM usage. I'd like to fix this by capping the coverage in the file, rather than downsampling, since I'm happy with the coverage in most parts of the genome.

ADD REPLY
0
Entering edit mode

Do you have a BED with the regions of interest?

ADD REPLY
0
Entering edit mode

I guess I could generate a BED of the regions with high coverage for each BAM if that helps. So, sure-ish. :)

How does it help?

ADD REPLY
8
Entering edit mode
9.3 years ago

I wrote two tools:

$ java -jar dist/sortsamrefname.jar -T bam /input.bam  |\
  java -jar dist/biostar154220.jar  -n 20 -T bam |\
  samtools sort - output
$ samtools mpileup output.bam  | cut -f 4 | sort | uniq -c
  12692 0
 596893 1
  94956 10
  56715 11
  76947 12
  57912 13
  66585 14
  51961 15
  63184 16
  47360 17
  65189 18
  65014 19
 364524 2
 169064 20
  72078 3
 118288 4
  54802 5
  82555 6
  53175 7
  78474 8
  54052 9

(what are those 21 and 22 ? I suppose it's because I don't consider supplementary, secondary, duplicate reads ) (bug fixed). Anyway, can you tell me if that's what you need?

Pierre

ADD COMMENT
0
Entering edit mode

Brilliant! I'll give it a spin with some real data and get back to you.

ADD REPLY
1
Entering edit mode

This works great. Here's a region where the bam in the lower panel was capped to 200x. (It's a bit hard to see, but the range in the upper panel is 0-2000, and the lower 0-200).

An interesting effect in that in the two slightly lower peaks in input.bam, the original coverage is 1000+, but in output it's not 200, but 160 in the left peak, and around 90 in the right peak (rather than the set cap of 200x). I'm assuming that the reason is related to the read pairs insert size (here around 450), since they are added as pairs. I've read the source but haven't been able to understand the implications of the algorithm completely.

Commands:

~/bin/jvarkit/dist-1.133/sortsamrefname -T bam /proj/b2010040/input.bam |\
  ~/bin/jvarkit/dist-1.133/biostar154220 -n 200 -T bam |\
  ~/bin/autoseq/tools/bin/samtools sort /dev/stdin /proj/b2010040/output

Edit: For future reference, input.bam contains reads mapped to M on GRCh37.

ADD REPLY
1
Entering edit mode

Thanks for letting me know that it works ( i want to be the first author of any paper ! :-P )

ADD REPLY
0
Entering edit mode

Hey! You have amazing tools, thank you! But I'm struggling to do this exactly as you say:

$ java -jar ~/Programs/jvarkit/dist/sortsamrefname.jar -T bam /input.bam

[SEVERE][Launcher]There was an error in the command line arguments. Please check your parameters : Expected one or zero argument but got 3 : [-T, bam, /input.bam]

Should be something basic! Cheers,

ADD REPLY
1
Entering edit mode

the tool has changed since this post: http://lindenb.github.io/jvarkit/SortSamRefName.html

try:

 java -jar ~/Programs/jvarkit/dist/sortsamrefname.jar  input.bam
ADD REPLY
0
Entering edit mode

So fast, thanks! It was basic, I was tired. Now I have a more relevant question. Do I have to create a sequence diccionary for this to work?

Ignoring SAM validation error due to lenient parsing: Error parsing text SAM file. RG ID on SAMRecord not found in header: 10x; File tmp.bam; Line 6 Line: J00137:124:HV5G7BBXX:4:1121:15016:43902 147 1 5 0 120S27M4S = 5 -27 GGTGGTGGGTGGTTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTTGTGTTTAGATGTGGCGGTGTGTTTATGCTCTGGGTCACATTTAAAGGTCAAATGTGACCCAGAGCATAAA 7)--)))-----F-JJ7FA-JJJFFJAJJJJFFFFJAAJJJFJJJJFJJFFFJJJJJJJ77JAJFJJF7AAJJJFJFAFAAA MD:Z:27 PG:Z:MarkDuplicates RG:Z:10x NM:i:0 AS:i:27 XS:i:27 [SEVERE][SortSamRefName]Reference index for '1' not found in sequence dictionary. java.lang.IllegalArgumentException: Reference index for '1' not found in sequence dictionary.

ADD REPLY
1
Entering edit mode

I imagine you're working with a bam without header. Most/All mapping tools include a sequence dictionary. This read is mapped on contig '1' but there is no '1' in the header.

ADD REPLY
0
Entering edit mode

Once again simple, merci bien!

Keep up the helpfulness, it really makes a difference for starting researchers like me.

ADD REPLY
1
Entering edit mode
9.3 years ago
Christian ★ 3.1k

You could also use "digital normalization" to address this problem https://arxiv.org/abs/1203.4802.

ADD COMMENT

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6