I have a bam file in which some reads are mapped uniquely and some are mapped to multiple genomic positions (because of gene duplications or pseudogenes or whatever). I want to be able to calculate the coverage for any base in the genome. I know how to calculate coverage in general, but I have two questions regarding the treatment of reads that map to multiple locations.
If a read maps to N locations, should I count it at each location with a depth of 1/N? In other words, should I "split" the coverage equally across the N locations? Should I use mapping quality or some other measure to do a weighted split? Or should I just count it as a depth of 1 at all locations where it maps?
When reading from a bam file, how can I easily tell how many positions a read has been mapped to? Is there a field in the bam file that gives me this information? Is the read simply listed multiple times? Can I easily preprocess the bam file to add a "map count" in a custom field?
Edit
I want to calculate coverage mostly for QC-stats. I am doing exome sequencing and I want to be able to answer questions such as "What fraction of exonic nucleotides were covered at at least 5x?"
FWIW, I think counting depth as 1/N makes the most sense. Otherwise, reads from repetitive regions get more weight then non-repetitive ones - i.e. the sum contribution from each read should be 1.
For point 1, it may be worth reading Alkan et al, 2009. http://www.nature.com/ng/journal/v41/n10/abs/ng.437.html