I have used hisat2 for my alignment for mouse genome and most of them were above 90% , them i converted the same file to bam and then sorted and indexed it. but when I used featureCounts to generate counts only around 35-65% of the sequences are getting annotated.Kindly help me through this if possible. The counts are
Status sorted_CC1.bam
Assigned 81909834
Unassigned_Unmapped 7465379
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 32127770
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 101236787
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 0
How will i improve it?
Seems like most of your reads are in genomic regions that have no annotation. Have you tried exploring through a genome browser (i.e. IGV) or by extracting the reads that are not overlap with exons?
i am very new to this data handling, can you please guide mw how should i do that and how will that help me.
I haven't done it myself but I can point you to the tools but you will have to do the heavy lifting yourself. I recommend checking out
bedops
to convert your annotation file to bed and to get a complement annotation (everything that is not exon). Then you want to start looking atsamtools view
to extract all the alignments within those regions before starting to make any assumptions. There are plenty of answers that already cover this so just search the forum and you will for sure find the answers.As noted below use Integrated Genome Viewer (IGV). Quick start guide: https://igv.org/doc/desktop/#QuickStart/
There is a more detailed user guide linked in left pane.
This is the command i had used, any modifications that i can do to get a better count?
With new versions of
featureCounts
you also need to add the following option when you have paired end reads and use-p
after adding that also, no such improvement in the %assigned
Since we can't access/see your data you are going to need to diagnose the issue yourself or ask for local help. You could also use
salmon
( https://salmon.readthedocs.io/en/latest/ ) with a set of latest mouse transcriptome to see if you get better results. Assuming all of your samples have similar assignments you could move forward with the analysis and see what you get. If there is a real problem with the data (e.g. bad libraries, DNA contamination etc) no bioinformatics magic will fix that.