Entering edit mode
9.0 years ago
Pei
▴
220
Hi all:
I am working on pair-end RNA-seq data to get read counts by running "bedtools multicov" on the bam file generated by Tophat.
Should I perform some kinds of filtering of reads before bedtools? (such as remove duplicated reads by samtools or remove poor-quality reads)
Your opinion must be most valuable.
Thanks in advance!
Best
Thanks Devon.
But could you give some more explanation for not using bedtools multicov?
It gives incorrect counts for statistics. The statistics you're trying to do need unique counts, which multicov won't produce, since that isn't its purpose.
Hi Devon,
Can you please tell what is the purpose of the bedtools multivcov ? And what kind of statistics you are talking.? I have used bedtools multicov many times to get read counts for downstream differential expression analysis
Thanks a lot
Chirag
Say you have DNAseq samples and want to know what the coverage in an area is, then multicov would be quite useful. If you're trying to get counts for DESeq2 or edgeR or something similar then your counts will tend to be wrong.
Hi Devon,
Thanks for your response. Can you please tell me why bedtools multicov would be wrong for DESeq2 and edgeR? What about samtools idxstats? is it useful for DESeq and edgeR?
Thanks
Chirag
This boils two to how one should deal with the following:
In the case of #1, multicov will generally include these, though there's likely a flag to prevent that. These should never be included in the counts given to DESeq2 and the others since it's breaks the statistical assumptions.
In the case of #2, my understanding was always that multicov will count a read for all features that it overlaps. This also violates the statistical assumptions used in DEseq2 and the others.
For these reasons I think most people use featureCounts these days (it's MUCH faster than htseq-count).
If you've aligned to the transcriptome and are trying to get transcript level counts, then yes you can use samtools idxstats, but note that you first need to remove multimappers. In this case I would strongly encourage you to instead use the BAM file with Salmon and then to use Sleuth instead of edgeR/DESeq2/etc. It's very likely that you'll get better results (in fact, you'll probably see this becoming "the standard method" that everyone uses over the next couple years).