Question

Different go terms for hg19 vs ensembl

2

Entering edit mode

7.7 years ago

gregory.l.stone ▴ 30

I am conducting a differential gene expression analysis to determine the difference in the change in gene expression in each sex due to a treatment in human samples. For this type of study would you recommend UCSC or ensembl? I am concerned by the significant difference in GO terms and kegg pathway results that I get from alignment and differential expression analysis using GRCh37 ensembl genome vs hg19 genome for human samples. There was a significant difference between the two all the way through the workflow, which was to be expected since the two reference genomes and annotations are different. I was wondering which one is most reliable? My results are so different between the two that I'm finding it hard to make sense of it all. Any help would be greatly appreciated.

RNA-Seq gene ontology kegg pathway ensembl UCSC • 2.1k views

ADD COMMENT • link 7.7 years ago by gregory.l.stone ▴ 30

score 2 · Accepted Answer · 2017-03-10

2

Entering edit mode

7.7 years ago

Petr Ponomarenko ★ 2.8k

Could you please be more specific with your workflow, tools used and input files. Maybe you can share some of the input files and intermediate results with us? Overal, huge difference in GO terms for same genes in different annotations is strange.

You obviously looked at GRCh37/38(NCBI) vs hg19/hg38(UCSC) , right?

ADD COMMENT • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Thank you for your response. I did look at that, but have also read other posts that suggest the results shouldnt be so different. I aligned my reads with STAR, and got dissimilar mapping. The following is samtools flagstat output for hg19 then ensembl aligned reads:

29991285 + 702821 in total (QC-passed reads + QC-failed reads)
2278391 + 64385 secondary
0 + 0 supplementary
0 + 0 duplicates
29991285 + 702821 mapped (100.00% : 100.00%)
27712894 + 638436 paired in sequencing
13871636 + 315364 read1
13841258 + 323072 read2
27665798 + 627952 properly paired (99.83% : 98.36%)
27665798 + 627952 with itself and mate mapped
47096 + 10484 singletons (0.17% : 1.64%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)



30213813 + 705544 in total (QC-passed reads + QC-failed reads)
2286418 + 64605 secondary
0 + 0 supplementary
0 + 0 duplicates
30213813 + 705544 mapped (100.00% : 100.00%)
27927395 + 640939 paired in sequencing
13978944 + 316580 read1
13948451 + 324359 read2
27880148 + 630366 properly paired (99.83% : 98.35%)
27880148 + 630366 with itself and mate mapped
47247 + 10573 singletons (0.17% : 1.65%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I then counted genes with featureCounts, and noticed that the ensembl annotation has way more info. The first is output from a couple read summaries with hg19 annotation and then ensembl annotation:

|| Load annotation file /groups/shared_databases/igenome/Homo_sapiens/UCS ... ||
||    Features : 392959                                                       ||
||    Meta-features : 23228                                                   ||
||    Chromosomes/contigs : 47                                                ||
||                                                                            ||
|| Process BAM file  ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 33961661                                              ||
||    Successfully assigned fragments : 14868327 (43.8%)                      ||
||    Running time : 3.29 minutes                                             ||
||                                                                            ||
|| Process BAM file  ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 33899360                                              ||
||    Successfully assigned fragments : 18155450 (53.6%)                      ||
||    Running time : 3.44 minutes                                             ||


|| Load annotation file /groups/shared_databases/igenome/Homo_sapiens/Ens ... ||
||    Features : 1309155                                                      ||
||    Meta-features : 62069                                                   ||
||    Chromosomes/contigs : 244                                               ||
||                                                                            ||
|| Process BAM file ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 34164741                                              ||
||    Successfully assigned fragments : 27688203 (81.0%)                      ||
||    Running time : 0.52 minutes                                             ||
||                                                                            ||
|| Process BAM file ... ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 34067700                                              ||
||    Successfully assigned fragments : 28410464 (83.4%)                      ||
||    Running time : 0.65 minutes                                             ||

The big difference is obviously that the ensembl annotation contains way more chromosomes/contigs and features. Are these added pieces of information obfuscating my results? Are they unnecessary, or added benefits?

ADD REPLY • link 7.7 years ago by gregory.l.stone ▴ 30

1

Entering edit mode

could you please use buttons for formatting on top of the message area like "101 010" to format your text? It is very hard to read it at the moment. Thank you

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

1

Entering edit mode

Im sorry, this is my first post!I just made edits for clarity, sorry again

ADD REPLY • link 7.7 years ago by gregory.l.stone ▴ 30

1

Entering edit mode

Now much better =) Thank you

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

1

Entering edit mode

What biological questions are you trying to answer with this analysis?

Ensebl has an annotation with many more features. This is well known and ok. You may open say RB1 gene in NSBI or UCSC and compare how many more data is in Ensembl for it. This is ok when you know what and why you are doing. Say for clinical genetics you are most interested only in one very particular transcript of RB1 and it is present in both annotations and almost identical. So all of other data is irrelevant for the first run of analysis in both annotations.

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

I am conducting a differential gene expression analysis to determine the difference in the change in gene expression in each sex due to a treatment. For this type of study would you recommend UCSC or ensembl?

ADD REPLY • link 7.7 years ago by gregory.l.stone ▴ 30

1

Entering edit mode

humans? If you want to publish with others able to use it in clinical research or settings, I would recommend reading about ACMG guidelines, about LRG and, personally, I would go with longest transcripts from RefSeq for each protein-coding genes first. If nothing interesting or meaningful will be found ask Biostars once again. How many reads you have? What is the treatment? What is the coverage? Be aware of repetitive and low-complexity regions.

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Yes, human samples. I have a paired study, with an average of 50 million paired end reads per sample. Thank you very much for your thoughtful answers and for being so generous with your time!

ADD REPLY • link 7.7 years ago by gregory.l.stone ▴ 30

0

Entering edit mode

you are welcome, gregory.l.stone =)

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

I think your question might be useful for many others. Could you please edit your question so it is easier for others to understand the scope of it and mark it one of the answers as accepted so others now that it was resolved. Maybe you can also summarize what you will find in ACMG guidelines and LRG later and add it here as well for others.

Also, I would love if others double-check my answers. My understanding can be not the best. But from my experience RefSeq and HGVS are very important for clinical people to understand any findings.

Thank you.

ADD REPLY • link 7.7 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Will do, thank you again for the help

ADD REPLY • link 7.7 years ago by gregory.l.stone ▴ 30