Hello,
First, just a brief mention that this is indeed a topic that has been brought-up in the past, but I would like some more specific feedback.
When comparing RNA-seq to ChIP-seq data, ideally, it would be great to use the same assembly and annotation files for both seqs. For any epigenetics work, I've found the UCSC genome and annotation files to be very valuable (considering the wealth of information you can acquire from tools such as the Table Browser and compatibility to several packages in R that use UCSC genomes). However, one issue that I have with using UCSC for RNA-seq is their GTF. UCSC usually relies more on transcript IDs than gene IDs, and I have noticed that the gene_id
(see below) field from their GTF files usually contains the same transcript id. Thus, summarizing the data by gene-level can be problematic here, if you use functions such as featureCounts
in R, where meta-features will be defined by the information present in the GTF.
gene_id "ENSMUST00000169927.1"; transcript_id "ENSMUST00000169927.1"
(the above is also present in other tracks such as RefSeq)
This is a non-issue if the Ensembl assembly and annotation file is used for RNA-seq. If I were looking at RNA-seq and ChIP-seq separately, I would have no issues choosing Ensembl for RNA-seq and UCSC for ChIP-seq/any epigenetics-seq. However, if one were to compare the results from RNA-seq to ChIP-seq, what is the best approach to resolve the conflicts among the databases. I find that these comparisons are not described in as much detail as they should in manuscripts ("we aligned our data to the rn6 genome....(but from where etc...)"), and I am wondering if people are actually choosing two different databases and simply looking at the gene names as the basis for the comparisons.
Any advice would help greatly!!
To clarify though, while indeed the version should be the same, both Ensembl (I focus on Ensembl, because Gencode is only for humans and mice) and UCSC have their own genome files. How much discrepancy is there between these genomes (they would still be the same version such as mm10 or rn6)? I am aware of the chr. name differences, and potential differences in the extra chromosomes at the end of FASTA files, but if everything else is the same then I don't see the harm in using the genome from their respective database that I want to annotate.
I ask because the GTF from UCSC still doesn't make sense to me to be used for RNAseq (in terms of having the IDs repeated). So if alignment were to be done with Ensembl for RNA and UCSC for Chipseq annotated with Ensembl track at UCSC, where promoter, intron and exon coordinates can be easily extracted, would this be an issue? Thanks for the clarification.
To complement my comment, I wanted to point out that indeed I have seen previous BioStar post discussing this such as this one
But in the end, there wasn't a conclusion about dealing with the transcript variants for RNA-seq gene-level analysis, which is relevant to my original question, if I were to want to use UCSC for both analysis.