Should Samtools Pileup Be Performed On Uniquely Mapped Reads Or All The Reads?
4
6
Entering edit mode
14.1 years ago
Doctoroots ▴ 800

I'm currently trying to infer the read depth in an area of interest from a bwa alignment sam file on an illumina run.

I am using the samtools pileup tool to get the per base coverage, and it seems to be working great. the problem is that I seem to get different read coverage when I perform the pileup on the original bwa sam file (that includes unmapped and non uniquely mapped reads) and when I perform the pileup on the same sam file only filtered for uniquely mapped reads.

The coverage is slightly higher when I use the unfiltered sam file, but if it includes non relevant reads, maybe I should stick to the uniquely mapped reads.

Your opinion?

Thanks.

pileup short-read samtools • 18k views
ADD COMMENT
5
Entering edit mode
14.1 years ago
Ian 6.1k

I would work out coverage from the mapped read. Although not necessarily uniquely mapped reads, for example BFAST can output the best uniquely scoring alignment from a read that maps to multiple places, or all equally best scoring alignments of a read.

I found using BedTools useful in determining coverage. The genomeCoverageBed tools can input ordered BAM files via the -ibam flag. The -d flag reports coverage at each base.

ADD COMMENT
1
Entering edit mode

a very nice tool, looks promising.

ADD REPLY
4
Entering edit mode
14.1 years ago

As the previous posts elude to: think of the biological question and calculate what is needed based on that.

For example I do the following (mainly using my own perl script)

  • non-normalised coverage per base
  • coverage per base where each sequence counts towards 1/n per base
  • where n = the number of times the sequence maps in the genome.
  • remove identical sequences to remove PCR amplification error
  • combination of the previous two points

Each technique has problems: e.g. when looking at repeats, are all repeats in the assembly and hence can normalisation be trusted. Or are all short reads coming from one repeat only and the rest are non-expressing. ... you get the idea. Looking at combination of the above allows me to build up a picture of expression: especially in context with the rest of the reads in a browser such as IGB later.

ADD COMMENT
1
Entering edit mode

@Alastair - I agree with your considerations and approach. Does your script work from a BAM/SAM file? Would you be able to post this script somewhere?

ADD REPLY
0
Entering edit mode

I use sam2interval on galaxy and it works on the output of that. I'll try and comment it up so it can be used by others. I'll post back here once its done (probably next week)

ADD REPLY
0
Entering edit mode

I've put the tools here. Check the README file for more information

ADD REPLY
2
Entering edit mode
14.1 years ago
Michael 55k

Most publications I have seen so far (for RNA-seq data) discarded non-uniquely mapped reads. That could become a problem mainly for duplicated regions which then will have little to no coverage. So if you see large regions with very low coverage, you could also need to use the unfiltered data again. Depends a bit on your application, e.g. I am not sure if I would filter for copy-number variation data, too.

ADD COMMENT
0
Entering edit mode

is that true? most RNA-Seq papers i've seen do "probabilistic" alignment of reads. because, for example, you can't be sure which transcript a particular read came from but don't want to discard it. cufflinks, splicemap, and supersplat all do some sort of statistical model to assign ambiguously mapped reads.

ADD REPLY
0
Entering edit mode

if the reads are paired, with the help of mapped insert size, one could tell if they are PCR duplicates. It would be probabilistically very unlikely. I would be interested to know how much of this could be said for single end reads and do people still mark duplicates in SE reads as well?

ADD REPLY
0
Entering edit mode

Michael, On the other hand, when you want to determine copy number variation, aren't you interested in "unique reads" (=> all fragments per molecule counted once)?

ADD REPLY
2
Entering edit mode
14.1 years ago
lh3 33k

SAMtools pileup discards unmapped reads, secondary alignments and duplicates. It uses non-unique reads. To show unique mappings only:

samtools view -uq1 aln.bam | samtools pileup -

or

samtools mpileup -q1 aln.bam

For RNA-seq, neither using unique reads only nor using all reads is optimal. I know sophisticated tools redistribute non-unique mappings, or even model read counting to get better estimate.

ADD COMMENT
0
Entering edit mode

That is a great way to pick unique alinments usually. But how to interpret the coexists of MAPQ 0 and XT:A:U in one alignment of bwa output?

ADD REPLY

Login before adding your answer.

Traffic: 1683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6