How many genes in RNA seq?
1
0
Entering edit mode
6.5 years ago
Arindam Ghosh ▴ 530

After running Stringtie/Feature counts the output is expression value of all the genes present in the annotation file. To be specific 60675 in GRCh38p5. For finding deferentially expressed genes is it good to work with all the genes? Are there any chance of duplicate genes in the number? There are no Ensemble ID duplication.

P.S.: Kindly suggest me some good papers that have done DEG studies with the new tuxedo suite.

RNA-Seq stringtie • 4.4k views
ADD COMMENT
0
Entering edit mode

What is your question?

ADD REPLY
0
Entering edit mode

60K genes in Hg38, are you sure?

ADD REPLY
0
Entering edit mode

@OP: did you try only known genes while alinging/calling transcripts? Did you check for novel transcripts?

ADD REPLY
0
Entering edit mode

Haven't touched novel transcripts yet. Looking to find DEG among known. The problem is DESeq2 reports ~18k DEG and Ballgown ~900 at p<0.05.

ADD REPLY
0
Entering edit mode

when you say 'new tuxedo suite", I believe you did the mapping with HISAT2. If yes, please share the mapping summary

ADD REPLY
0
Entering edit mode

The overall alignment rate was >90% for all sample

ADD REPLY
0
Entering edit mode
6.5 years ago

Are you saying, there are 60, 675 genes in human ? :P That number is only around 20k; check your stats once again. How do you get to this number by the way?

Anyway, what set of genes to consider or not largely depends on your objective. Literature reviews on similar studies will help you focus on the set of genes you should look at but you may also find something new.

ADD COMMENT
2
Entering edit mode

If you would include non-coding RNA's I guess 20k is an underestimation. 20k is the number of protein-coding genes, which is not the only relevant part of the genome.

ADD REPLY
0
Entering edit mode

Even adding the non coding genes would not lead to 60K

ADD REPLY
1
Entering edit mode

That's right. For annotation summary info, check this page from Ensembl.

ADD REPLY
3
Entering edit mode

As @WouterDeCoster points out, 60K is about right. Its about what we see from a StringTie/Cufflinks assembly and is pretty similar to the count of all transcribed loci from Ensembl.

As to whether you want to use them all or not.... That's more difficult. Whether under or over annotation is more of a problem in differential expression analysis is something to which there is not yet a good answer.

ADD REPLY
0
Entering edit mode

Actually I wanted to know this because DESeq2 gave me 18k DEG of all 60k. Does the total number matter in statistical analysis?

ADD REPLY
2
Entering edit mode

18k differentially expressed genes is an enormous amount, which leads me to believe something wrong is going on.

ADD REPLY
1
Entering edit mode

DESeq2 is pretty sensitive these days if you put no lfc threshold on it. If there was a strong perturbation between treatment and control and nice tight replicates, 30% of genes being DE without an effect size threshold doesn't surprise me that much. I bet its not 18k protein coding genes. If the pc/ncRNA split was even, then that would be 6,000 pc genes, without wouldn't be that weird.

ADD REPLY
1
Entering edit mode

ag1805x : You need to edit the original question and add this information there. Assuming the question you wanted to ask was "is it ok to have 18K genes as DE". As it stands the question and this trail of comments has become difficult to follow.

ADD REPLY
0
Entering edit mode

In a counts -> DESeq type analysis the biggest effect of having too many genes present is that you are reducing your power by wasting power on genes that don't exist.

Its also possible that you are throwing off your normalisation. Your normalisation factor will be wrong if you either miss genes that exist or include genes that don't, although I wouldn't expect it to be very wrong unless there was something seriously weird about your samples.

ADD REPLY
1
Entering edit mode

Well 20K is protein coding genes. 60k includes pseudo genes and non-coding genes. Please refer http://mar2016.archive.ensembl.org/Homo_sapiens/Info/Annotation

ADD REPLY
0
Entering edit mode

Yes, now that makes sense! well what type of genes to use is again depends on your objective. Obviously non coding genes are also very important

ADD REPLY
1
Entering edit mode

Including splice isoforms, GENCODE lists ~200,000 transcripts across all biotypes: https://www.gencodegenes.org

ADD REPLY

Login before adding your answer.

Traffic: 2277 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6