Question

How can I determine the number of detected genes and detected transcripts/isoforms in a BAM file + GTF file?

0

Entering edit mode

21 months ago

O.rka ▴ 740

This current question is a follow up to this previous question. Would very much appreciate help with the previous one if you have the expertise.

I need to determine the number of unique genes and unique transcripts. I have the following files:

lrwxrwxrwx 1 jespinoz jcl110 57 Mar 13 21:13 mapped.sorted.bam.coverage.tsv.gz -> ../intermediate/1__star/mapped.sorted.bam.coverage.tsv.gz
lrwxrwxrwx 1 jespinoz jcl110 45 Mar 13 21:13 mapped.sorted.bam.bai -> ../intermediate/1__star/mapped.sorted.bam.bai
lrwxrwxrwx 1 jespinoz jcl110 44 Mar 13 21:13 mapped.reads.list.gz -> ../intermediate/1__star/mapped.reads.list.gz
lrwxrwxrwx 1 jespinoz jcl110 41 Mar 13 21:13 mapped.sorted.bam -> ../intermediate/1__star/mapped.sorted.bam

I also have the GTF file for genome which is for GRCh38.p13_GENCODE.39_ERCC92 (i.e., human gencode build + ERCC92 spikes).

What tools can I use to get statistics such as the number of genes and transcripts detected from a BAM file using the GTF as a guide?

rnaseq scRNA-seq single-cell genomics ngs • 1.1k views

ADD COMMENT • link updated 21 months ago by GenoMax 148k • written 21 months ago by O.rka ▴ 740

0

Entering edit mode

Looks like you have already used featureCounts which is what would be recommended here.

If you want to do transcript level estimations then should you not be using salmon or kallisto instead?

ADD REPLY • link 21 months ago by GenoMax 148k

0

Entering edit mode

I probably should have used salmon in the first place but wanted to knock-out-2-bird-with-one-stone in creating a wrapper for STAR for another project. Do you know why there would be more transcripts detected than genes? I'm not sure if there are some quality parameters that are set to default that I might be missing.

ADD REPLY • link 21 months ago by O.rka ▴ 740

0

Entering edit mode

You have this tagged as single cell so assuming that is what the data is referring to perhaps you should have been using STARsolo or alevin? Those are the right tools for single cell data.

As for this from other thread:

In every case I'm getting more unique genes than transcripts which doesn't make any sense to me since transcripts are a subset of genes

Isn't that what the biology does. Each gene is capable of producing a large number of transcripts (when there are multiple exons) so the number of unique genes is always going to be smaller than the possible set of transcripts they can potentially produce.

ADD REPLY • link 21 months ago by GenoMax 148k

0

Entering edit mode

That's what I was expecting but I got the opposite where there were more detected genes relative to the number of detected transcripts. Kind of visible on the plot if you look at the range on the y-axis. I didn't use STAR solo but the sequencing core demultiplexed the reads into their individual cells so I had a fastq for each single-cell.

ADD REPLY • link 21 months ago by O.rka ▴ 740

0

Entering edit mode

If this is single cell data then this is likely because of the 3' end detection bias. In general you are also detecting less genes because of technical limitations of current methods.

ADD REPLY • link 21 months ago by GenoMax 148k