Entering edit mode
21 months ago
O.rka
▴
740
This current question is a follow up to this previous question. Would very much appreciate help with the previous one if you have the expertise.
I need to determine the number of unique genes and unique transcripts. I have the following files:
lrwxrwxrwx 1 jespinoz jcl110 57 Mar 13 21:13 mapped.sorted.bam.coverage.tsv.gz -> ../intermediate/1__star/mapped.sorted.bam.coverage.tsv.gz
lrwxrwxrwx 1 jespinoz jcl110 45 Mar 13 21:13 mapped.sorted.bam.bai -> ../intermediate/1__star/mapped.sorted.bam.bai
lrwxrwxrwx 1 jespinoz jcl110 44 Mar 13 21:13 mapped.reads.list.gz -> ../intermediate/1__star/mapped.reads.list.gz
lrwxrwxrwx 1 jespinoz jcl110 41 Mar 13 21:13 mapped.sorted.bam -> ../intermediate/1__star/mapped.sorted.bam
I also have the GTF file for genome which is for GRCh38.p13_GENCODE.39_ERCC92 (i.e., human gencode build + ERCC92 spikes).
What tools can I use to get statistics such as the number of genes and transcripts detected from a BAM file using the GTF as a guide?
Looks like you have already used
featureCounts
which is what would be recommended here.If you want to do transcript level estimations then should you not be using
salmon
orkallisto
instead?I probably should have used salmon in the first place but wanted to knock-out-2-bird-with-one-stone in creating a wrapper for STAR for another project. Do you know why there would be more transcripts detected than genes? I'm not sure if there are some quality parameters that are set to default that I might be missing.
You have this tagged as single cell so assuming that is what the data is referring to perhaps you should have been using
STARsolo
oralevin
? Those are the right tools for single cell data.As for this from other thread:
Isn't that what the biology does. Each gene is capable of producing a large number of transcripts (when there are multiple exons) so the number of unique genes is always going to be smaller than the possible set of transcripts they can potentially produce.
That's what I was expecting but I got the opposite where there were more detected genes relative to the number of detected transcripts. Kind of visible on the plot if you look at the range on the y-axis. I didn't use STAR solo but the sequencing core demultiplexed the reads into their individual cells so I had a fastq for each single-cell.
If this is single cell data then this is likely because of the 3' end detection bias. In general you are also detecting less genes because of technical limitations of current methods.