Question

Tophat / Cufflinks Not Using Gene_Id Or Gene_Name From Gtf File

3

Entering edit mode

12.1 years ago

Owen S. ▴ 370

I am trying to get tophat / cufflinks to work with Ensembl annotations.

Although this looks like a long post, my problem is quite simple: I keep getting CUFF.1, CUFF.2, CUFF.3, etc, as my geneid, in the genes.fpkmtracking file:

$ head genes.fpkm_tracking 
tracking_id     class_code      nearest_ref_id  gene_id gene_short_name tss_id  locus   length  coverage        FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
CUFF.1  -       -       CUFF.1  -       -       chr1:568966-570302      -       -       1080.24 1019.13 1141.36 OK
CUFF.4  -       -       CUFF.4  -       -       chr1:979748-982084      -       -       19.74   10.1647 29.3153 OK
CUFF.2  -       -       CUFF.2  -       -       chr1:982297-984413      -       -       11.6648 5.19434 18.1353 OK
CUFF.3  -       -       CUFF.3  -       -       chr1:881909-892399      -       -       24.6448 15.3299 33.9596 OK

What I want is to have the Ensembl ENSG annotations, and possibly gene_name as well.

Yes, I realize the chromosome annotation (1st column) in the GTF/GFF file must match the FASTA header exactly.

I have tried the "native" Ensembl GTF file (chromosomes numbered 1, 2, 3..., X, Y, MT) and a bowtie2 genome index built from Ensembl's GRCh37.69 FASTA:

1       processed_transcript    exon    11869   12227   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002234944";
1       processed_transcript    exon    12613   12721   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002867822";
1       processed_transcript    exon    13221   14409   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";

I have tried the native GTF with "chr" added in, and with the FASTA file modified in the same way (and of course bt2 index rebuilt):

chr1    processed_transcript    exon    11869   12227   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002234944";
chr1    processed_transcript    exon    12613   12721   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002867822";
chr1    processed_transcript    exon    13221   14409   .       +       .        gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";

I have also tried using cufflinks gffread utility, to make a GFF file from the above GTF file:

chr1    processed_transcript    gene    11869   14412   .       +       .       ID=ENSG00000223972;Name=DDX11L1;transcripts=ENST00000450305,ENST00000456328,ENST00000515242,ENST00000518655
chr1    processed_transcript    misc_RNA        11869   14409   .       +       .       ID=ENST00000456328;Parent=ENSG00000223972;geneID=ENSG00000223972;gene_name=DDX11L1;Name=DDX11L1-002;type=pseudogene
chr1    processed_transcript    exon    11869   12227   .       +       .       Parent=ENST00000456328
chr1    processed_transcript    exon    12613   12721   .       +       .       Parent=ENST00000456328
chr1    processed_transcript    exon    13221   14409   .       +       .       Parent=ENST00000456328

For which the corresponding bowtie2 index has these entries:

$ bowtie2-inspect -n /hulk/genomes/Ensembl/Homo_sapiens.GRCh37.69.dna.chromosome 
chr10
chr11
chr12
chr13
...

I have even tried using the iGenomes files, with the same results!

The way I am running tophat and cufflinks is like this:

tophat -G Homo_sapiens.GRCh37.69.gtf --transcriptome-index=Homo_sapiens.GRCh37.69.genes -p 24 -o output Homo_sapiens.GRCh37.69.dna Sample1_1.fq.gz Sample1_2.fq.gz
cufflinks -p 6 -o output output/accepted_hits.bam

It is almost like I am running it without annotations, but the tophat / cufflinks programs never complain about being unable to match annotations with the FASTA file or anything:

[2012-11-09 17:11:40] Beginning TopHat run (v2.0.6)
-----------------------------------------------
[2012-11-09 17:11:40] Checking for Bowtie
                  Bowtie version:        2.0.2.0
[2012-11-09 17:11:40] Checking for Samtools
                Samtools version:        0.1.18.0
[2012-11-09 17:11:41] Checking for Bowtie index files
[2012-11-09 17:11:41] Checking for reference FASTA file
[2012-11-09 17:11:41] Generating SAM header for /hulk/genomes/Ensembl/Homo_sapiens.GRCh37.69.dna.chromosome
        format:          fastq
        quality scale:   phred33 (default)
[2012-11-09 17:11:44] Reading known junctions from GTF file
[2012-11-09 17:12:02] Preparing reads
         left reads: min. length=30, max. length=101, 999911 kept reads (89 discarded)
        right reads: min. length=30, max. length=101, 999341 kept reads (659 discarded)
[2012-11-09 17:12:43] Creating transcriptome data files..
...snip...
[2012-11-09 18:10:58] Reporting output tracks
-----------------------------------------------
[2012-11-09 18:16:25] Run complete: 01:04:44 elapsed

You are using Cufflinks v2.0.2, which is the most recent release.
[18:16:26] Inspecting reads and determining fragment length distribution.
> Processed 120637 loci.                       [*************************] 100%
> Map Properties:
>       Normalized Map Mass: 985482.42
>       Raw Map Mass: 985482.42
>       Fragment Length Distribution: Empirical (learned)
>                     Estimated Mean: 163.12
>                  Estimated Std Dev: 50.10
[18:17:15] Assembling transcripts and estimating abundances.
> Processed 120951 loci.                       [*************************] 100%

Yes, this topic has been discussed online ad nauseum. Here, and here, and Cufflinks - Assigning Transcripts Or Exons, and many other places. I feel like I have read every possible internet page about this topic, but still the problem vexes me. In the past I once had this working just fine. In fact, I found that tophat / cufflinks worked very nicely with Ensembl annotations. However, it seems like something has changed -- either the software or the annotations? Am I the only one experiencing this? Can someone help me see what I am overlooking? Any help would be much appreciated.

Here are the software versions I am using:

$ bowtie2 --version
/usr/local/bin/bowtie2-2.0.2/bowtie2-align version 2.0.2
64-bit
Built on igm1
Wed Oct 31 23:16:47 EDT 2012
$ tophat --version
TopHat v2.0.6
$ cufflinks 
cufflinks v2.0.2
linked against Boost version 104900

tophat cufflinks gtf gff ensembl • 15k views

ADD COMMENT • link updated 12.1 years ago by Nick Crawford ▴ 210 • written 12.1 years ago by Owen S. ▴ 370

3

Entering edit mode

Are you just trying to get tag counts? Have you tried running cufflinks with the -G option?

ADD REPLY • link 12.1 years ago by Damian Kao 16k

1

Entering edit mode

Damian, Thank you. I am slightly embarrassed by the question now but I will leave it here for others to see. I see I was trying to adhere too closely to the code shown in the authors' Nature Protocols paper (p.571) which does not show using -G or -g with cufflinks. In the past, I must have been using this option, because I know in the past I was getting proper carry-over of the reference annotations.

Thanks again for taking the time to answer.

ADD REPLY • link 12.1 years ago by Owen S. ▴ 370

0

Entering edit mode

could you please add [solved] to the title for people scanning unanswered questions

ADD REPLY • link 12.1 years ago by Ying W ★ 4.3k

score 1 · Accepted Answer · 2012-11-20

1

Entering edit mode

12.1 years ago

Nick Crawford ▴ 210

It doesn't appear that you're setting either the -G/--GTF or the -g/--GTF-guide flags when you run cufflinks. See comments above as well.

ADD COMMENT • link 12.1 years ago by Nick Crawford ▴ 210

0

Entering edit mode

Gss, totally saved my day this post, has been trying to figure out this for days now, i thought the first step in the tophat is going to inherit all of the gene_id...

ADD REPLY • link 11.5 years ago by mad.cichlids ▴ 140

0

Entering edit mode

Hi, Sorry to bother you guys I have take the argument -g gene.gtf along the way of tophat, cufflinks, cuffmerge. Why the name of my gene_exp.diff file still using gene names rather than gene_id, and why the gene_id was replaced by some automatically generated numbers like :XLOC_OOOOO1,,,XLOC_OOOOO15466...... Do i need to specify it in cuffdiff too? but i did not see this -g argument in the cuff.diff. Any way that i can specify to take gene_id rather than gene names? By the way, i was using the ensemble gene annotation file. Thank you so much.

tophat -p 4 -G genes.gtf -o tophat_results024 genome paired_024-3_L1_I005.R1.clean.fastq,paired_024-3_L1_I005.R2.clean.fastq

cufflinks -p 4 -g genes.gtf -o cufflink_results025_mt/ tophat_results025_mt/accepted_hits.bam

cuffmerge -g genes.gtf -s genome.fa -p 4 assemblies_mt.txt

cuffdiff -o diff_out_mt -b genome.fa -p 4 -L Y24,Y25 -u merged_asm/merged.gtf ./tophat_results024_mt/accepted_hits.bam ./tophat_results025_mt/accepted_hits.bam

ADD REPLY • link 11.5 years ago by mad.cichlids ▴ 140