I am conducting a differential gene expression analysis to determine the difference in the change in gene expression in each sex due to a treatment in human samples. For this type of study would you recommend UCSC or ensembl? I am concerned by the significant difference in GO terms and kegg pathway results that I get from alignment and differential expression analysis using GRCh37 ensembl genome vs hg19 genome for human samples. There was a significant difference between the two all the way through the workflow, which was to be expected since the two reference genomes and annotations are different. I was wondering which one is most reliable? My results are so different between the two that I'm finding it hard to make sense of it all. Any help would be greatly appreciated.
Thank you for your response. I did look at that, but have also read other posts that suggest the results shouldnt be so different. I aligned my reads with STAR, and got dissimilar mapping. The following is samtools flagstat output for hg19 then ensembl aligned reads:
I then counted genes with featureCounts, and noticed that the ensembl annotation has way more info. The first is output from a couple read summaries with hg19 annotation and then ensembl annotation:
The big difference is obviously that the ensembl annotation contains way more chromosomes/contigs and features. Are these added pieces of information obfuscating my results? Are they unnecessary, or added benefits?
could you please use buttons for formatting on top of the message area like "101 010" to format your text? It is very hard to read it at the moment. Thank you
Im sorry, this is my first post!I just made edits for clarity, sorry again
Now much better =) Thank you
What biological questions are you trying to answer with this analysis?
Ensebl has an annotation with many more features. This is well known and ok. You may open say RB1 gene in NSBI or UCSC and compare how many more data is in Ensembl for it. This is ok when you know what and why you are doing. Say for clinical genetics you are most interested only in one very particular transcript of RB1 and it is present in both annotations and almost identical. So all of other data is irrelevant for the first run of analysis in both annotations.
I am conducting a differential gene expression analysis to determine the difference in the change in gene expression in each sex due to a treatment. For this type of study would you recommend UCSC or ensembl?
humans? If you want to publish with others able to use it in clinical research or settings, I would recommend reading about ACMG guidelines, about LRG and, personally, I would go with longest transcripts from RefSeq for each protein-coding genes first. If nothing interesting or meaningful will be found ask Biostars once again. How many reads you have? What is the treatment? What is the coverage? Be aware of repetitive and low-complexity regions.
Yes, human samples. I have a paired study, with an average of 50 million paired end reads per sample. Thank you very much for your thoughtful answers and for being so generous with your time!
you are welcome, gregory.l.stone =)
I think your question might be useful for many others. Could you please edit your question so it is easier for others to understand the scope of it and mark it one of the answers as accepted so others now that it was resolved. Maybe you can also summarize what you will find in ACMG guidelines and LRG later and add it here as well for others.
Also, I would love if others double-check my answers. My understanding can be not the best. But from my experience RefSeq and HGVS are very important for clinical people to understand any findings.
Thank you.
Will do, thank you again for the help