Tophat / Cufflinks Not Using Gene_Id Or Gene_Name From Gtf File
1
3
Entering edit mode
12.0 years ago
Owen S. ▴ 370

I am trying to get tophat / cufflinks to work with Ensembl annotations.

Although this looks like a long post, my problem is quite simple: I keep getting CUFF.1, CUFF.2, CUFF.3, etc, as my geneid, in the genes.fpkmtracking file:

$ head genes.fpkm_tracking 
tracking_id     class_code      nearest_ref_id  gene_id gene_short_name tss_id  locus   length  coverage        FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
CUFF.1  -       -       CUFF.1  -       -       chr1:568966-570302      -       -       1080.24 1019.13 1141.36 OK
CUFF.4  -       -       CUFF.4  -       -       chr1:979748-982084      -       -       19.74   10.1647 29.3153 OK
CUFF.2  -       -       CUFF.2  -       -       chr1:982297-984413      -       -       11.6648 5.19434 18.1353 OK
CUFF.3  -       -       CUFF.3  -       -       chr1:881909-892399      -       -       24.6448 15.3299 33.9596 OK

What I want is to have the Ensembl ENSG annotations, and possibly gene_name as well.

Yes, I realize the chromosome annotation (1st column) in the GTF/GFF file must match the FASTA header exactly.

I have tried the "native" Ensembl GTF file (chromosomes numbered 1, 2, 3..., X, Y, MT) and a bowtie2 genome index built from Ensembl's GRCh37.69 FASTA:

1       processed_transcript    exon    11869   12227   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002234944";
1       processed_transcript    exon    12613   12721   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002867822";
1       processed_transcript    exon    13221   14409   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";

I have tried the native GTF with "chr" added in, and with the FASTA file modified in the same way (and of course bt2 index rebuilt):

chr1    processed_transcript    exon    11869   12227   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002234944";
chr1    processed_transcript    exon    12613   12721   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002867822";
chr1    processed_transcript    exon    13221   14409   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";

I have also tried using cufflinks gffread utility, to make a GFF file from the above GTF file:

chr1    processed_transcript    gene    11869   14412   .       +       .       ID=ENSG00000223972;Name=DDX11L1;transcripts=ENST00000450305,ENST00000456328,ENST00000515242,ENST00000518655
chr1    processed_transcript    misc_RNA        11869   14409   .       +       .       ID=ENST00000456328;Parent=ENSG00000223972;geneID=ENSG00000223972;gene_name=DDX11L1;Name=DDX11L1-002;type=pseudogene
chr1    processed_transcript    exon    11869   12227   .       +       .       Parent=ENST00000456328
chr1    processed_transcript    exon    12613   12721   .       +       .       Parent=ENST00000456328
chr1    processed_transcript    exon    13221   14409   .       +       .       Parent=ENST00000456328

For which the corresponding bowtie2 index has these entries:

$ bowtie2-inspect -n /hulk/genomes/Ensembl/Homo_sapiens.GRCh37.69.dna.chromosome 
chr10
chr11
chr12
chr13
...

I have even tried using the iGenomes files, with the same results!

The way I am running tophat and cufflinks is like this:

tophat -G Homo_sapiens.GRCh37.69.gtf --transcriptome-index=Homo_sapiens.GRCh37.69.genes -p 24 -o output Homo_sapiens.GRCh37.69.dna Sample1_1.fq.gz Sample1_2.fq.gz
cufflinks -p 6 -o output output/accepted_hits.bam

It is almost like I am running it without annotations, but the tophat / cufflinks programs never complain about being unable to match annotations with the FASTA file or anything:

[2012-11-09 17:11:40] Beginning TopHat run (v2.0.6)
-----------------------------------------------
[2012-11-09 17:11:40] Checking for Bowtie
                  Bowtie version:        2.0.2.0
[2012-11-09 17:11:40] Checking for Samtools
                Samtools version:        0.1.18.0
[2012-11-09 17:11:41] Checking for Bowtie index files
[2012-11-09 17:11:41] Checking for reference FASTA file
[2012-11-09 17:11:41] Generating SAM header for /hulk/genomes/Ensembl/Homo_sapiens.GRCh37.69.dna.chromosome
        format:          fastq
        quality scale:   phred33 (default)
[2012-11-09 17:11:44] Reading known junctions from GTF file
[2012-11-09 17:12:02] Preparing reads
         left reads: min. length=30, max. length=101, 999911 kept reads (89 discarded)
        right reads: min. length=30, max. length=101, 999341 kept reads (659 discarded)
[2012-11-09 17:12:43] Creating transcriptome data files..
...snip...
[2012-11-09 18:10:58] Reporting output tracks
-----------------------------------------------
[2012-11-09 18:16:25] Run complete: 01:04:44 elapsed

You are using Cufflinks v2.0.2, which is the most recent release.
[18:16:26] Inspecting reads and determining fragment length distribution.
> Processed 120637 loci.                       [*************************] 100%
> Map Properties:
>       Normalized Map Mass: 985482.42
>       Raw Map Mass: 985482.42
>       Fragment Length Distribution: Empirical (learned)
>                     Estimated Mean: 163.12
>                  Estimated Std Dev: 50.10
[18:17:15] Assembling transcripts and estimating abundances.
> Processed 120951 loci.                       [*************************] 100%

Yes, this topic has been discussed online ad nauseum. Here, and here, and Cufflinks - Assigning Transcripts Or Exons, and many other places. I feel like I have read every possible internet page about this topic, but still the problem vexes me. In the past I once had this working just fine. In fact, I found that tophat / cufflinks worked very nicely with Ensembl annotations. However, it seems like something has changed -- either the software or the annotations? Am I the only one experiencing this? Can someone help me see what I am overlooking? Any help would be much appreciated.

Here are the software versions I am using:

$ bowtie2 --version
/usr/local/bin/bowtie2-2.0.2/bowtie2-align version 2.0.2
64-bit
Built on igm1
Wed Oct 31 23:16:47 EDT 2012
$ tophat --version
TopHat v2.0.6
$ cufflinks 
cufflinks v2.0.2
linked against Boost version 104900
tophat cufflinks gtf gff ensembl • 15k views
ADD COMMENT
3
Entering edit mode

Are you just trying to get tag counts? Have you tried running cufflinks with the -G option?

ADD REPLY
1
Entering edit mode

Damian, Thank you. I am slightly embarrassed by the question now but I will leave it here for others to see. I see I was trying to adhere too closely to the code shown in the authors' Nature Protocols paper (p.571) which does not show using -G or -g with cufflinks. In the past, I must have been using this option, because I know in the past I was getting proper carry-over of the reference annotations.

Thanks again for taking the time to answer.

ADD REPLY
0
Entering edit mode

could you please add [solved] to the title for people scanning unanswered questions

ADD REPLY
1
Entering edit mode
12.0 years ago
Nick Crawford ▴ 210

It doesn't appear that you're setting either the -G/--GTF or the -g/--GTF-guide flags when you run cufflinks. See comments above as well.

ADD COMMENT
0
Entering edit mode

Gss, totally saved my day this post, has been trying to figure out this for days now, i thought the first step in the tophat is going to inherit all of the gene_id...

ADD REPLY
0
Entering edit mode

Hi, Sorry to bother you guys I have take the argument -g gene.gtf along the way of tophat, cufflinks, cuffmerge. Why the name of my gene_exp.diff file still using gene names rather than gene_id, and why the gene_id was replaced by some automatically generated numbers like :XLOC_OOOOO1,,,XLOC_OOOOO15466...... Do i need to specify it in cuffdiff too? but i did not see this -g argument in the cuff.diff. Any way that i can specify to take gene_id rather than gene names? By the way, i was using the ensemble gene annotation file. Thank you so much.

tophat -p 4 -G genes.gtf -o tophat_results024 genome paired_024-3_L1_I005.R1.clean.fastq,paired_024-3_L1_I005.R2.clean.fastq

cufflinks -p 4 -g genes.gtf -o cufflink_results025_mt/ tophat_results025_mt/accepted_hits.bam

cuffmerge -g genes.gtf -s genome.fa -p 4 assemblies_mt.txt

cuffdiff -o diff_out_mt -b genome.fa -p 4 -L Y24,Y25 -u merged_asm/merged.gtf ./tophat_results024_mt/accepted_hits.bam ./tophat_results025_mt/accepted_hits.bam

ADD REPLY

Login before adding your answer.

Traffic: 2133 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6