Hello folks,
I wondered about the appropriate gene annotation for finding differential expression with Cufflinks in human RNA-Seq data without finding new splice junctions (using hg19).
My test data is from the SRA: http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP000698
As an annotation file, I downloaded Homo_sapiens.GRCh37.63.gtf (as seen in other tutorials) and as genome I use hg19, so no assembly was done (and I skipped the cufflinks step and started directly with cuffdiff). I ask this question because I got no differential expression with cuffdiff because all tests failed with NOTEST. But viewing the bam-files with IGV they are well aligend at exons.
This is my pipeline (ok, I pooled the samples, this seems not to be the best way, but anyway, I should get some results), based on http://vallandingham.me/RNA_seq_differential_expression.html
tophat -G /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf -p 4 -r 160 -o output/SRR0278/top_SRR027863-65_withannotation ../References/hg19/hg19 fastq/SRR0278/SRR027863_1.fastq,fastq/SRR0278/SRR027864_1.fastq,fastq/SRR0278/SRR027865_1.fastq fastq/SRR0278/SRR027863_2.fastq,fastq/SRR0278/SRR027864_2.fastq,fastq/SRR0278/SRR027865_2.fastq
tophat -G /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf -p 4 -r 160 -o output/SRR0278/top_SRR027866-67_withannotation ../References/hg19/hg19 fastq/SRR0278/SRR027866_1.fastq,fastq/SRR0278/SRR027867_1.fastq fastq/SRR0278/SRR027866_2.fastq,fastq/SRR0278/SRR027867_2.fastq
cuffdiff -p 4 -L actived,resting -o diff_SRR027863-65vs66-67_withannotation /ngs_process/projects/References/annotations/Homo_sapiens.GRCh37.63.gtf top_SRR027863-65_withannotation/accepted_hits.bam top_SRR027866-67_withannotation/accepted_hits.bam
When looking at the file geneexp.diff, I got useless results all the way (head geneexp.diff):
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value q_value significant
ENSG00000000003 ENSG00000000003 TSPAN6 X:99883666-99894988 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000000005 ENSG00000000005 TNMD X:99839798-99854882 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000000419 ENSG00000000419 DPM1 20:49551403-49575092 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000000457 ENSG00000000457 SCYL3 1:169631244-169863408 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000000460 ENSG00000000460 C1orf112 1:169631244-169863408 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000000938 ENSG00000000938 FGR 1:27938574-27961788 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000000971 ENSG00000000971 CFH 1:196621007-196716634 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000001036 ENSG00000001036 FUCA2 6:143816613-143832827 actived resting NOTEST 0 0 0 0 1 1 no
ENSG00000001084 ENSG00000001084 GCLC 6:53362138-53481969 actived resting NOTEST 0 0 0 0 1 1 no
Using samtools flagstat topSRR027863-65withannotation/accepted_hits.bam:
28324268 in total
0 QC failure
0 duplicates
28324268 mapped (100.00%)
28324268 paired in sequencing
14290946 read1
14033322 read2
20901782 properly paired (73.79%)
23773098 with itself and mate mapped
4551170 singletons (16.07%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)
And samtools flagstat topSRR027866-67withannotation/accepted_hits.bam:
14750160 in total
0 QC failure
0 duplicates
14750160 mapped (100.00%)
14750160 paired in sequencing
7202896 read1
7547264 read2
10245492 properly paired (69.46%)
12095724 with itself and mate mapped
2654436 singletons (18.00%)
0 with mate mapped to a different chr
0 with mate mapped to a different chr (mapQ>=5)
So I wondered if this annotation file is the best to use in this case. Or why do I get no useful results? I provided tophat and cuffdiff the same annotation file, so even if its wrong, it should give me some results.
Are there any ideas what went wrong? Running without any annotation as described in the Cufflinks tutorial ( http://cufflinks.cbcb.umd.edu/tutorial.html ), I get some results with differential expression (my old pipeline: http://biostar.stackexchange.com/questions/11401/meaning-of-fpkm-value-used-by-cufflinks )
I also found http://cufflinks.cbcb.umd.edu/igenomes.html that provided an annotation package: ftp://igenome:G3nom3s4u@ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz but I'm not sure if the gene.gtf file I found in the package is appropriate. Where is the difference to Homo_sapiens.GRCh37.63.gtf ?
I would be happy if anyone has any hints for me :-)
Thanks in advance, Oliver
Further details:
cuffdiff v1.0.3 (2403)
TopHat v1.3.1
Ok, actually I tried the gene.gtf file from iGenome and this works in some way since the exons are identified by the regions on their chromosomes and not by an scaffold id. I got useful test results for 66% of the genes in the file, the other 34% fail. Not that satisfying at all but better than before :-)
As for genes "failing", keep in mind that in any given differentiated tissue, not all genes will be expressed AT ALL. So, 65% of genes expressed does not sound all that unusual to me.
Thanks for that hint, Sean. Actually, both files are from UCSC. But indeed, the hg19 sequence names start with 'chr..' (and actually nothing follows despite the chromosome number ;-)) and the gtf-file entries contain in the first column something like 'GL000213.1' ... so this is no chromosome number. Could this be the problem? If so, how do I solve it?
The chromosome names need to match in the reference to which you aligned and the gtf file.