I want to convert the cufflinks output from transcript level to gene level with the exon info for doing the gene expression analysis.
Any method to collapse cufflinks transcripts to genes. I thought I would check before writing code.
Thanks! -Abhi
I want to convert the cufflinks output from transcript level to gene level with the exon info for doing the gene expression analysis.
Any method to collapse cufflinks transcripts to genes. I thought I would check before writing code.
Thanks! -Abhi
If you used the -G option and a reference GTF file with Cufflinks, then you should have an expression value for each transcript in this GTF file. The original GTF file should contain transcript-to-gene relationships allowing you to merge multiple transcripts to a single gene. The GTF file should also contain strand info. You may also be able to use the GTF file that is generated during the Cufflinks run and stored in the output directory. Since Cufflinks does merge from the transcript level to the gene level for you, perhaps you can combine the genes.fpkm_tracking file with the GTF to get a single file with expression, strand, and exon level info.
As for exon level info ... This is not clear. Each gene may have multiple transcripts and each transcript has multiple exons. The exons from each transcript may be unique to that transcript, redundant with an exon in one or more additional transcripts, or partially overlapping with an exon in one or more additional transcripts. If you merge to the gene level, what do you mean by maintaining exon level info? Perhaps you can further describe exactly what you want your output file to look like. Do you want one row per exon? Or one row per gene? If so, how would exon information be represented in this file?
For purposes of discussion here is some sample Cufflinks output:
genes.fpkm_tracking
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
ENSG00000236601 - - ENSG00000236601 ENSG00000236601 - 1:453632-460480 - - 0 0 0 OK
ENSG00000224813 - - ENSG00000224813 ENSG00000224813 - 1:329783-453948 - - 0.00976477 0 0.0663024 OK
isoforms.fpkm_tracking
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
ENST00000450983 - - ENSG00000236601 ENSG00000236601 - 1:453632-460480 607 0 0 0 0 OK
ENST00000412666 - - ENSG00000236601 ENSG00000236601 - 1:453826-460465 426 0 0 0 0 OK
ENST00000431812 - - ENSG00000224813 ENSG00000224813 - 1:329783-334271 336 0.190769 0.00976477 0 0.0292943 OK
ENST00000445840 - - ENSG00000224813 ENSG00000224813 - 1:334125-334305 180 1.02218e-07 5.23219e-09 0 0.0413242 OK
ENST00000455207 - - ENSG00000224813 ENSG00000224813 - 1:334128-446155 413 3.98904e-15 2.04184e-16 0 0.0167376 OK
ENST00000455464 - - ENSG00000224813 ENSG00000224813 - 1:334139-342806 573 1.4205e-13 7.27101e-15 0 0.028762 OK
ENST00000440163 - - ENSG00000224813 ENSG00000224813 - 1:439364-453722 462 0 0 0 0 OK
ENST00000453935 - - ENSG00000224813 ENSG00000224813 - 1:450886-453942 498 0 0 0 0 OK
ENST00000431321 - - ENSG00000224813 ENSG00000224813 - 1:453216-453948 406 0 0 0 0 OK
transcripts.gtf
1 Cufflinks transcript 453633 460480 1 - . gene_id "ENSG00000236601"; transcript_id "ENST00000450983"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 453633 454166 1 - . gene_id "ENSG00000236601"; transcript_id "ENST00000450983"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 460408 460480 1 - . gene_id "ENSG00000236601"; transcript_id "ENST00000450983"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks transcript 453827 460465 1 - . gene_id "ENSG00000236601"; transcript_id "ENST00000412666"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 453827 454166 1 - . gene_id "ENSG00000236601"; transcript_id "ENST00000412666"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 460380 460465 1 - . gene_id "ENSG00000236601"; transcript_id "ENST00000412666"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks transcript 329784 334271 1000 + . gene_id "ENSG00000224813"; transcript_id "ENST00000431812"; FPKM "0.0097647655"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.029294"; cov "0.190769";
1 Cufflinks exon 329784 329976 1000 + . gene_id "ENSG00000224813"; transcript_id "ENST00000431812"; exon_number "1"; FPKM "0.0097647655"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.029294"; cov "0.190769";
1 Cufflinks exon 334129 334271 1000 + . gene_id "ENSG00000224813"; transcript_id "ENST00000431812"; exon_number "2"; FPKM "0.0097647655"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.029294"; cov "0.190769";
1 Cufflinks transcript 334126 334305 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000445840"; FPKM "0.0000000052"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.041324"; cov "0.000000";
1 Cufflinks exon 334126 334305 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000445840"; exon_number "1"; FPKM "0.0000000052"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.041324"; cov "0.000000";
1 Cufflinks transcript 334129 446155 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000455207"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.016738"; cov "0.000000";
1 Cufflinks exon 334129 334297 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000455207"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.016738"; cov "0.000000";
1 Cufflinks exon 439467 439568 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000455207"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.016738"; cov "0.000000";
1 Cufflinks exon 446014 446155 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000455207"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.016738"; cov "0.000000";
1 Cufflinks transcript 334140 342806 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000455464"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.028762"; cov "0.000000";
1 Cufflinks exon 334140 334297 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000455464"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.028762"; cov "0.000000";
1 Cufflinks exon 342392 342806 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000455464"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.028762"; cov "0.000000";
1 Cufflinks transcript 439365 453722 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000440163"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 439365 439568 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000440163"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 446014 446193 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000440163"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 453645 453722 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000440163"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks transcript 450887 453942 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000453935"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 450887 451086 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000453935"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 453645 453942 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000453935"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks transcript 453217 453948 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000431321"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 453217 453318 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000431321"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
1 Cufflinks exon 453645 453948 1 + . gene_id "ENSG00000224813"; transcript_id "ENST00000431321"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
genes.fpkm_tracking file should have gene level information
@RM : I probably should have mentioned it but the information in the genes.fpkm_tracking file is not sufficient. It doesn't have the exon info as well as strand of the gene. It contains gene start and end coordinates.
@Abhi: can you give input and possible output so that it will be more clearer...
You should also indicate how you ran Cufflinks. In particular, did you supply a GTF file of known transcripts using the -G or -g option?
I did supply my GTF file of known transcripts but I am not sure if that will make any difference to the question. What I am looking for is a way to collapse the cufflinks transcripts into genes retaining the exon and strand level info Thanks!