I'm performing an RNA-seq analysis. I'm making a count matrix using bam files and a gtf file to make a count matrix.
I wanted to compare my results using a gtf file from the hg19 (ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/) build of the genome and the hg38 (ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/) build of the genome.
I used GenomicFeatures and GenomicAlignments in R to create a count matrix of raw reads using the following code.
filenames=list.files() bamfiles <- BamFileList(filenames, yieldSize=2000000) se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=TRUE) counts=assay(se)
My count matrix is completely different depending on which gtf file I use.
When I plot the statistics from DSS or DESeq results in the downstream analysis based on the gene id (i.e. ENSG00000223972), I get no correlation.
Why would I get completely different count matrices and results when I use different gtf files?
Can someone point me to good gtf files for hg19 and hg38 builds of the genome?
No wonder. You can't swap GTF files alone. You will need to realign the data against the other genome if you want to get the right results. Your chromosome names also need to match so make sure you use compatible genome builds and GTF files that go with them. You can find bundled sequence/annotation/indexes at iGenomes site.
Use featureCounts to do your counting.
I think I spotted your problem right here: you are using a GTF file.