When we align reads to reference genome with gff annotation file, how can we know that ..% of total reads got aligned to annotated regions of the reference genome. Is there any tool to check this?
When we align reads to reference genome with gff annotation file, how can we know that ..% of total reads got aligned to annotated regions of the reference genome. Is there any tool to check this?
Something similar has been covered previously:
Extract Reads From A Bam File That Fall Within A Given Region
Short answer, Samtools has the functionality to extract a few known positions quite easily.
For the larger, more scalable solution, look to:
Extracting regions from .bam file using a .gff or .gtf file
Brent Wilson, PhD | Project Scientist | Cofactor Genomics
4044 Clayton Ave. | St. Louis, MO 63110 | tel. 314.531.4647
Catch the latest from Cofactor on our blog.
With default settings, most aligners allow multi-mappings for a read, and hence requires more caution with percentage wise statistics.
When exactly only one alignment per read is being considered ( the statistics is with respect to the primary alignment only or so), the below should suffice.
1) Get primary alignments only. samtools view -bh -F 256 all_aligned.bam >only_primary_aligned.bam
If the bam file has unaligned as well (as in bowtie2 or bwa) use 260 instead of 256.
2) Get primary alignment count
samtools view -c -F 256 all_aligned.bam >only_primaryAligned.bam.count (count of primary aligned reads)
3) Get primary alignments that has alignment with at least one gff feature (Please see -u documentation for bedtools intersect)
bedtools intersect -abam only_primaryAligned.bam -b annotation.gff3 -u -bed >annotations_intersection.bed
4) Take count of annotations_intersections.bed using wc -l or so (--> annotation_intersections.count)
(annotation_intersections.count/only_primaryAligned.bam.count)*100 should give percentage based on aligned reads.
(annotation_intersections.count/total_read_count)*100 should give percentage based on total reads.
But in the situation where you do not limit to primary alignments or uniquely mapped only, a read may have one or more alignments in genic regions as well as one or more alignments in inter-genic regions and so on.Hence appropriate consideration need to be given for these possible multiple alignments while calculating.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks, second link worked: