Hidden reads in IGV
0
0
Entering edit mode
6.0 years ago

Dear all,

I have aligned reads to a reference genome with BWA MEM and removed the duplicated reads with SAMBAMBA. I then used samtools to get a coverage distribution of the reads and used a script in R to cluster the coverage. The samtools command was:

samtools depth <aligned_deduplicated.bam> > <file.tsv>

which gives a column with the genomic position and one with the related coverage. By pooling the data with R, I got, for example for the stretch 367,853,177 to 367,853,698, an average of 124,391 reads per base. If I plot this region on IGV though, I get this: enter image description here

The threshold values for the quality of alignment and mapping are both set to 50, but the figure does not change if I use 0, 1 or 30.

Am I right in assuming that the problem is about the quality of the reads provided by samtools? That is: samtools depth provides all the read coverage for a given position regardless of the quality (both alignment and mapping) whereas IGV filters this information? Maybe the reads are hard clipped, that is why are not shown -- but then, how to show them?

And how can I then used samtools depth with an associated quality value? That is, how can I add a third column to the depth's output that reports the alignment (or mapping or better both) score?

Thank you

mapping IGV visualization read quality • 6.5k views
ADD COMMENT
1
Entering edit mode

See if Alignment or Filtering Prefs help https://software.broadinstitute.org/software/igv/Preferences

ADD REPLY
0
Entering edit mode

Could you separate the reads in the problematic region and load them on IGV to dissect the problem?

samtools view -h my.bam chr1:10000-100000 |samtools view -bS - > region.bam

and check that 1) This contains as many reads as you expect 2) Load only this "small" bam on IGV to check if it shows there 3) If nothing explains, check the alignment inside the BAMfile and post a some lines which are in BAM but not reported in IGV.

ADD REPLY
1
Entering edit mode

Hello,

first of all have a look at one specific position in the output file of samtools depth and compare this position in igv with setting mapping quality and base quality threshold to 0 in the preferences. In the same preference tab you should also make sure that duplicates, supplementary alignment and other things are shown.

If you start samtools depth without any parameter you will get the help page:

$ samtools depth

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
   -a                  output all positions (including zero depth)
   -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
   -b <bed>            list of positions or regions
   -f <list>           list of input BAM filenames, one per line [null]
   -l <int>            read length threshold (ignore reads shorter than <int>) [0]
   -d/-m <int>         maximum coverage depth [8000]. If 0, depth is set to the maximum
                       integer value, effectively removing any depth limit.
   -q <int>            base quality threshold [0]
   -Q <int>            mapping quality threshold [0]
   -r <chr:from-to>    region
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth.  Note that positions with zero coverage may be
omitted by default; see the -a option.

As you can see there option to include mapping and base quality into calculation.

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you, by using the -Q flag I was able to remove many of the records previously found; for instance, there is nothing in the region 367,853,177 to 367,853,698 with a quality of mapping of 50. However, the depth file still has two columns only. Would be possible to create a file with position, coverage and quality? I set the thresholds to 0 and removed the filter for the duplications, but still I could not see the 124 000 reads in position 367,853,177 to 367,853,698. However, in position 46480796 I can see the reported 2 reads of coverage: enter image description here and at position 278016720 a more realistic coverage of 8000 that this time DOES show in IGV. enter image description here The answer must be therefore that IGV DOES filter the reads. What would be a good threshold for mapping? 30? 50? more?

ADD REPLY
0
Entering edit mode

Hello again,

could you please make a screenshot of the preferences in the alignment tab?

Have you checked what samtools depth reported for this position? Looking there we can bypass any problems with your R script. 124 000 reads in one position would be a very, very high depth. Is this value possible in your data?

Instead of asking the question "What mean mapping quality do I have on a given position?" you should asked "How many high quality data do I have on a given position"? In often used cut-off for mapping quality is 20 and for base quality 13. So you could run samtools depth -a -Q 20 -q 13 input.bam > depth.csv .

fin swimmer

ADD REPLY
0
Entering edit mode

sure, here it is: enter image description here To note that I set the quality thresholds to 0 and filter duplicates is off. IS there a paper with a consensus quality threshold reported? Thanks

ADD REPLY
0
Entering edit mode

Ok, the settings looks good.

Please also answer my answers questions:

Have you checked what samtools depth reported for this position? Looking there we can bypass any problems with your R script. 124 000 reads in one position would be a very, very high depth. Is this value possible in your data?

fin swimmer

ADD REPLY
0
Entering edit mode

There was indeed a flaw in the R script I wrote, it was not flushing data properly. Now everything looks good. I am filtering the reads by quality and the coverage is again in more reasonable ranges. Tx

ADD REPLY
0
Entering edit mode

You cannot add a third column to samtools depth it is just not what it was designed to do.

I will also say that for high coverage data (like the one you have with a coverage of 124,000x) that also may contain a reads with multiple alignments, duplicates, secondary and supplementary alignments and all kinds of manner of qualities I have never been able to reconcile the three methods that I could use to compute depth:

samtools depth
bedtools genomecov

and visual inspection with IGV.

The three methods will almost never produce the exact same values for depths, sometimes disagree enormously. Coping strategies include filtering your SAM file so that the tools do not have to apply additional corrections. Out of all methods IGV applies the most number of "tacit" assumptions that one may not even normally think about.

Example: Suppose a base that is "covered" by a deletion in the read. Should one increment the coverage for that coordinate?

ADD REPLY
3
Entering edit mode

The difference between the tools that calculate "read depth" or "coverage" are results of what is count and how is count.

Here's a little example and a comparison between bedtools genomecov, samtools depth and picard CollectHsMetric with the PER_BASE_COVERAGE parameter.

Let's assume these 4 reads:

POS 0123456789
F1  ------>
R1     <------ 
F2  ------>
R2     <------

F1 and R1 are a read pair. F2 and R2 are a read paire and marked as duplicates of F1 and R1.

bedtools genomecov transfer the read coordinates to bed intervals and doesn't care about any relation information we have. It than just counts the overlap on each position. So the result will be a read depth of 2 for the positions 0-2 and 7-9 and a read depth of 4 for the positions 3-6.

samtools depth take care of the information that F2/R2 are duplicates. So it will ignor these reads when counting. This results in a read depth of 1 for the positions 0-2 and 7-9 and a read depth 2 for the position 3-6.

picard CollectHsMetrics also ignores duplicates when counting. The difference to samtools depth is, that it doesn't count reads. It is counting fragments. This means if a read pair has an overlap at a given position, this is counted only once and not twice. This results in a coverage (I explicit avoid to call it "read depth") of 1 for the positions 0-9.

Both - samtools depth and picard CollectHsMetrics - can also exclude bases or whole reads depending on their quality.

The most accordance to igv you will get with samtools depth, because igv also counts reads and not fragments and you can regulate which reads are shown and thus counted.

fin swimmer

ADD REPLY
0
Entering edit mode

that is a nice summary that covers cases that haven't occurred to me. It is bioinformatics alright ... even simple concepts like read depth and coverage may have many competing definitions.

ADD REPLY
0
Entering edit mode

You are right in calling those 120x coverage bits as artifacts in fact when I filter the data by quality, they simply disappear; now I only have few spots with a coverage of 2 and one of 8000 (position 278016720).

ADD REPLY

Login before adding your answer.

Traffic: 2152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6