How to find % alignment of total reads to genic region?
2
0
Entering edit mode
8.3 years ago
sun.nation ▴ 140

When we align reads to reference genome with gff annotation file, how can we know that ..% of total reads got aligned to annotated regions of the reference genome. Is there any tool to check this?

alignment gff bowtie • 2.5k views
ADD COMMENT
2
Entering edit mode
8.3 years ago
brent_wilson ▴ 140

Something similar has been covered previously:

Extract Reads From A Bam File That Fall Within A Given Region

Short answer, Samtools has the functionality to extract a few known positions quite easily.

For the larger, more scalable solution, look to:

Extracting regions from .bam file using a .gff or .gtf file

Brent Wilson, PhD | Project Scientist | Cofactor Genomics

4044 Clayton Ave. | St. Louis, MO 63110 | tel. 314.531.4647

Catch the latest from Cofactor on our blog.

ADD COMMENT
0
Entering edit mode

Thanks, second link worked:

gff2bed < Corca1_all_genes_20140615.gff > Corca.bed
cut -f1,2,3,8 Corca.bed > Corca-selected.bed
egrep -v "stop_codon|start_codon" Corca-selected.bed > corca-final.bed
bgzip corca-final.bed
tabix corca-final.bed.gz
bgzip after-snps_hardFilter_DPlt10_Nocall.vcf-ADgt10-biallelic.vcf
tabix after-snps_hardFilter_DPlt10_Nocall.vcf-ADgt10-biallelic.vcf.gz
bcftools annotate -a corca-final.bed.gz after-snps_hardFilter_DPlt10_Nocall.vcf-ADgt10-biallelic.vcf.gz -c CHROM,FROM,TO,ID > c1_c2_soy_annotate.vcf
grep -v "#" c1_c2_soy_annptate.vcf  > woheader.c1_c2_soy_annptate.vcf
ADD REPLY
0
Entering edit mode
8.3 years ago
Jeffin Rockey ★ 1.3k

With default settings, most aligners allow multi-mappings for a read, and hence requires more caution with percentage wise statistics.

When exactly only one alignment per read is being considered ( the statistics is with respect to the primary alignment only or so), the below should suffice.

1) Get primary alignments only. samtools view -bh -F 256 all_aligned.bam >only_primary_aligned.bam

If the bam file has unaligned as well (as in bowtie2 or bwa) use 260 instead of 256.

2) Get primary alignment count

samtools view -c -F 256 all_aligned.bam >only_primaryAligned.bam.count (count of primary aligned reads)

3) Get primary alignments that has alignment with at least one gff feature (Please see -u documentation for bedtools intersect)

bedtools intersect -abam only_primaryAligned.bam -b annotation.gff3 -u -bed >annotations_intersection.bed

4) Take count of annotations_intersections.bed using wc -l or so (--> annotation_intersections.count)

(annotation_intersections.count/only_primaryAligned.bam.count)*100 should give percentage based on aligned reads.

(annotation_intersections.count/total_read_count)*100 should give percentage based on total reads.

But in the situation where you do not limit to primary alignments or uniquely mapped only, a read may have one or more alignments in genic regions as well as one or more alignments in inter-genic regions and so on.Hence appropriate consideration need to be given for these possible multiple alignments while calculating.

ADD COMMENT

Login before adding your answer.

Traffic: 2040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6