Question

Should Samtools Pileup Be Performed On Uniquely Mapped Reads Or All The Reads?

6

Entering edit mode

14.5 years ago

Doctoroots ▴ 810

I'm currently trying to infer the read depth in an area of interest from a bwa alignment sam file on an illumina run.

I am using the samtools pileup tool to get the per base coverage, and it seems to be working great. the problem is that I seem to get different read coverage when I perform the pileup on the original bwa sam file (that includes unmapped and non uniquely mapped reads) and when I perform the pileup on the same sam file only filtered for uniquely mapped reads.

The coverage is slightly higher when I use the unfiltered sam file, but if it includes non relevant reads, maybe I should stick to the uniquely mapped reads.

Your opinion?

Thanks.

pileup short-read samtools • 18k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 14.5 years ago by Doctoroots ▴ 810

Ram · Answer 1 · 2010-11-04

5

Entering edit mode

14.5 years ago

Ian 6.1k

I would work out coverage from the mapped read. Although not necessarily uniquely mapped reads, for example BFAST can output the best uniquely scoring alignment from a read that maps to multiple places, or all equally best scoring alignments of a read.

I found using BedTools useful in determining coverage. The genomeCoverageBed tools can input ordered BAM files via the -ibam flag. The -d flag reports coverage at each base.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.5 years ago by Ian 6.1k

1

Entering edit mode

a very nice tool, looks promising.

ADD REPLY • link 14.5 years ago by Doctoroots ▴ 810

Ram · Answer 2 · 2010-11-04

4

Entering edit mode

14.5 years ago

Alastair Kerr 5.3k

As the previous posts elude to: think of the biological question and calculate what is needed based on that.

For example I do the following (mainly using my own perl script)

non-normalised coverage per base
coverage per base where each sequence counts towards 1/n per base
where n = the number of times the sequence maps in the genome.
remove identical sequences to remove PCR amplification error
combination of the previous two points

Each technique has problems: e.g. when looking at repeats, are all repeats in the assembly and hence can normalisation be trusted. Or are all short reads coming from one repeat only and the rest are non-expressing. ... you get the idea. Looking at combination of the above allows me to build up a picture of expression: especially in context with the rest of the reads in a browser such as IGB later.

ADD COMMENT • link 14.5 years ago by Alastair Kerr 5.3k

1

Entering edit mode

@Alastair - I agree with your considerations and approach. Does your script work from a BAM/SAM file? Would you be able to post this script somewhere?

ADD REPLY • link 14.5 years ago by Casey Bergman 18k

0

Entering edit mode

I use sam2interval on galaxy and it works on the output of that. I'll try and comment it up so it can be used by others. I'll post back here once its done (probably next week)

ADD REPLY • link 14.5 years ago by Alastair Kerr 5.3k

0

Entering edit mode

I've put the tools here. Check the README file for more information

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 14.5 years ago by Alastair Kerr 5.3k

score 2 · Answer 3 · 2010-11-04

2

Entering edit mode

14.5 years ago

Michael 55k

Most publications I have seen so far (for RNA-seq data) discarded non-uniquely mapped reads. That could become a problem mainly for duplicated regions which then will have little to no coverage. So if you see large regions with very low coverage, you could also need to use the unfiltered data again. Depends a bit on your application, e.g. I am not sure if I would filter for copy-number variation data, too.

ADD COMMENT • link 14.5 years ago by Michael 55k

0

Entering edit mode

is that true? most RNA-Seq papers i've seen do "probabilistic" alignment of reads. because, for example, you can't be sure which transcript a particular read came from but don't want to discard it. cufflinks, splicemap, and supersplat all do some sort of statistical model to assign ambiguously mapped reads.

ADD REPLY • link 14.5 years ago by brentp 24k

0

Entering edit mode

if the reads are paired, with the help of mapped insert size, one could tell if they are PCR duplicates. It would be probabilistically very unlikely. I would be interested to know how much of this could be said for single end reads and do people still mark duplicates in SE reads as well?

ADD REPLY • link 13.6 years ago by Arun 2.4k

0

Entering edit mode

Michael, On the other hand, when you want to determine copy number variation, aren't you interested in "unique reads" (=> all fragments per molecule counted once)?

ADD REPLY • link 13.6 years ago by Arun 2.4k

Ram · Answer 4 · 2010-11-04

2

Entering edit mode

14.5 years ago

lh3 33k

SAMtools pileup discards unmapped reads, secondary alignments and duplicates. It uses non-unique reads. To show unique mappings only:

samtools view -uq1 aln.bam | samtools pileup -

or

samtools mpileup -q1 aln.bam

For RNA-seq, neither using unique reads only nor using all reads is optimal. I know sophisticated tools redistribute non-unique mappings, or even model read counting to get better estimate.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.5 years ago by lh3 33k

0

Entering edit mode

That is a great way to pick unique alinments usually. But how to interpret the coexists of MAPQ 0 and XT:A:U in one alignment of bwa output?

ADD REPLY • link 13.2 years ago by Ct586 ▴ 630