Entering edit mode
3 days ago
rajdeepboral00
▴
10
This is currently the featureCounts command I am using to get the counts, but only 39.8% is getting successfully assigned. Can anyone please suggest that how can I improve my counts?
featureCounts -p -T 8 \
--countReadPairs \
-a /path/to/GRCm39/gencode.vM36.chr_patch_hapl_scaff.annotation.gtf \
-s2 \
-t exon \
-o counts_CT1.txt sorted_CT1.bam
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v2.0.8
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| ||
|| sorted_CT1.bam ||
|| ||
|| Output file : counts_CT1.txt ||
|| Summary : counts_CT1.txt.summary ||
|| Paired-end : yes ||
|| Count read pairs : yes ||
|| Annotation : gencode.vM36.chr_patch_hapl_scaff.annotation ... ||
|| Dir for temp files : ./ ||
|| ||
|| Threads : 8 ||
|| Level : meta-feature level ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file gencode.vM36.chr_patch_hapl_scaff.annotation.gtf ... ||
|| Features : 1821019 ||
|| Meta-features : 78298 ||
|| Chromosomes/contigs : 38 ||
|| ||
|| Process BAM file sorted_CT1.bam... ||
|| Strand specific : reversely stranded ||
|| Paired-end reads are included. ||
|| Total alignments : 84936948 ||
|| Successfully assigned alignments : 33798384 (39.8%) ||
|| Running time : 0.72 minutes
There is no way to "improve" counts, if that is what you data has (and you have done the analysis properly). You seem to be counting at the exon level (and not summarizing at gene level), is that being done on purpose?
No, i want to get the gene counts. I though this is equivalent to gene counts only. Can you please tell me how to get the gene count? what to include in the code? Actually i am very new to this data analysis and i dont know much.
If I remember correctly you will need to add the
-g
parameter (eg.-g gene_id
) where the value of that parameter is what you want to summarize on.In my example it will summarize all matches of reads to exons onto the level denoted by gene_id (might be something else in the gff you are using though, but gene_id is the most likely one) In essence you will need to use the tag that groups all exons to a unique feature (gene_id, transcript_id ... )
The command looks fine to me. The
-g
parameter should begene_id
by default. The-t exon
parameter is also default. These two options work together to specify, for each gene id, calculate the number of reads over their exons.Adding the
-f
flag would make it count each exon seperately. As is (without-f
), it counts reads over each exon, then adds the counts from exons with the same gene_id.The percentage of mapped reads assigned to features in your annotation file will depend on what annotations you are mapping them to and the nature of your data.
For example, for ribo-depleted RNA-seq this is roughly the rate I see when mapping only to gene exons, especially when excluding multi-mappers and overlapping reads. This is because, for my data, many RNA molecules come from repetitive (many copies in the genome) regions, and I also detect a significant amount of intronic signal, which can add up considering how large introns are.
I have added the -g gene_id and -t exon options too but still the %assigned in featureCounts is same only...so is that normal to have such low percentage? or there maybe chance that i am doing some kind of error? The protocol used ribo depletion.Kindly answer.
Also inspect your alignments visually in a browser like IGV to see where the reads are piling up. If they are aligning outside gene models then that is a characteristic of your data.
Can you confirm that you are using the correct GTF file linked to the genome you used for doing the mapping (species, version, ...) ?
One way to check why there is a low read count is to also count the multimappings. (you might get some pointers from this post for inspiration: featureCounts --primary tag counting uniquely mapped reads and excludes multi-mapped reads? ). If the "missing percentages" are in that part you at least know what is going on.
This is the summary of my alignment for one of my samples and i had used the fasta and gtf files from https://www.gencodegenes.org/mouse/ ...For fasta i have taken the "whole genome" and for the gtf i have used the "Comprehensive gene annotation ALL It contains the comprehensive gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)" of GRCm39 Should i take into account the ambiguous and multimapping reads?
Looks like a large fraction of your reads do not map to known features (or could be mapping to rRNA/introns, which you are likely not counting).
Can there be a chance that i am making any mistakes? or this is what the data is.
There's no obvious mistake I see, but I agree with GenoMax that you should check your RNA signal on IGV to see how it aligns to genes.
I tend to use the "primary" GTFs instead of the "All" gtfs, but it seems that it shouldn't matter that much.