Question

what should be the expected result of cuffmerge?

1

Entering edit mode

10.8 years ago

Pei ▴ 240

Hi:

I am working on Drosophila RNA-seq data as the Nat Protocal paper [1]

After running the program Cufflinks, I got expression value (FPKM) which were associated with IDs like "CUFF.2"

Then I run the program cuffmerge "to create a single merged transcriptome annotation" [1]

But what I got is a file merged.gtf with lines like

2L    Cufflinks    exon    74903    75018    .    +    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000004"; exon_number "3"; gene_name "galectin"; oId "CUFF.2.2"; nearest_ref "FBtr0078101"; class_code "j"; tss_id "TSS3";

This file merged.gtf did not provide FPKM values at all.

Have I got the correct/expected result?

Should Cuffmerge returns expression values associated with each gene (such as a matrix with gene symbol as row names and FPKMs in each row)

Thanks in advance!

Best

[1] Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. ; 7(3): 562-578. doi:10.1038/nprot.2012.016

cuffmerge • 7.1k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Pei ▴ 240

0

Entering edit mode

Thank you Devon!

Could you please also tell me how to generate a matrix with rows for gene expression and cols for samples which using gene symbol (or other identifiers) as row names?

That is, how to transform "CUFF.2.2" to a symbol making biological sense.

Should I write scripts for this purpose?

Many Thanks!

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Pei ▴ 240

0

Entering edit mode

cuffmerge (and cufflinks, for that matter) can be given a GTF or GFF annotation file that will contain things like gene names. You can get that from where ever you downloaded the reference fasta file. Regarding a matrix of expression values, I don't use cufflinks/cuffdiff enough to know of a simple method to do that, though I presume that cuffdiff would output such a file.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you Devon!

I have download all annotation file from http://ccb.jhu.edu/software/tophat/igenomes.shtml

and used these two file for Tophat:

Drosophila_melanogaster/Ensembl/BDGP5.25/Annotation/Archives/archive-2014-05-23-16-02-55/Genes/genes.gtf
Drosophila_melanogaster/Ensembl/BDGP5.25/Sequence/BowtieIndex/genome

The cuffmerge output file merged_asm/merged.gtf contains 3900 CUFF IDs while the cufflinks output transcripts.gtf contains > 10000 CUFF IDs.

Do you think there is something inconsistent, given that many cufflinks identified CUFFs were missed by cuffmerge?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Pei ▴ 240

0

Entering edit mode

Usually the files from igenomes work well together. Perhaps there's a problem in this case. Try to run cuffmerge again, ensuring that you specify the genes.gtf file. You might also look into the genes.gtf file to ensure that it contains gene names and IDs.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you!

Below are the first two lines of genes.gtf file:

2L    protein_coding    exon    7529    8116    .    +    .    exon_number "1"; gene_id "FBgn0031208"; gene_name "CG11023"; p_id "P9062"; transcript_id "FBtr0300689"; transcript_name "CG11023-RB"; tss_id "TSS8382";
2L    protein_coding    exon    7529    8116    .    +    .    exon_number "1"; gene_id "FBgn0031208"; gene_name "CG11023"; p_id "P8862"; transcript_id "FBtr0300690"; transcript_name "CG11023-RC"; tss_id "TSS8382";

It seemed OK. But I found that under the dir BowtieIndex, files actually do not provide the gene symbol and transcripts ID. Could this be the reason for showing CUFF.2.2 in transcripts.gtf files?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Pei ▴ 240

0

Entering edit mode

I assume that the files in BowtieIndex are just fasta files, so they wouldn't contain those names. That's expected. Try rerunning cuffmerge with the gtf file and see if that fixes things. If not then I don't know what the problem is.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks a lot!

When running cuffmerge I got such warnings:

Warning: couldn't find fasta record for '2LHet'!
Warning: couldn't find fasta record for '2RHet'!
Warning: couldn't find fasta record for '3LHet'!
Warning: couldn't find fasta record for '3RHet'!
Warning: couldn't find fasta record for 'U'!
Warning: couldn't find fasta record for 'XHet'!
Warning: couldn't find fasta record for 'YHet'!
Warning: couldn't find fasta record for 'dmel_mitochondrion_genome'!

Did you see thing like these?

Thanks!

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Pei ▴ 240

0

Entering edit mode

That suggests a mismatch between the files. I don't use drosophila for anything so I'm never used the files you're using.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.8 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-02-04

The purpose of cuffmerge isn't to produce FPKMs. What cufflinks does is (1) look for annotated and unannotated transcripts in a sample and (2) assign an FPKM value to them. The issue is, that different samples will typically result in somewhat different annotations, so the FPKM values aren't comparable (well, they wouldn't be anyway, but that's another issue). Consequently, what cuffmerge does is merge the annotations from all of the samples into a single master annotation that can then be used to estimate FPKM values for each of the samples.

I should also note that FPKM values can't reliably be estimated one sample at a time. The reason is that with a single sample, you have no way to control for library differences other than simply dividing by the library size, which is known to be quite unreliable. Consequently, cuffdiff (or cuffnorm in the updated pipeline, I think) has to re-estimate everything with the merged annotation and then perform normalization of the results before it can produce FPKM values.

BTW, unless you actually need to look for novel isoforms/genes, you're best off just not using cufflinks/cuffdiff. Those programs are very slow and not particularly flexible.