problem with using gtf file in cufflinks in Galaxy
2
0
Entering edit mode
8.6 years ago

Dear all,

I am using cufflinks in Galaxy, but I faced with the problem with FPKMs.

When I use gtf file (hg19: shared data > data libraries > iGenomes )and reference genome for human available in Galaxy(hg19: a tophat option), the cufflinks output gives me genes ID, but most FPKMs are zero.

Can you help me find out the problem?

How can I choose a compatible reference genome for that GTF file?

Thank you in advance

Nazanin

RNA-Seq GTF Cufflinks FPKM • 2.6k views
ADD COMMENT
0
Entering edit mode
8.6 years ago
michael.ante ★ 3.9k

Hi Nazanin,

can you check the chromosome names in the GTF and the bam file? Do they start with a chr in the GTF and without in the bam -- or vice versa? Is the alignment performed on HG19 annotation?

Cheers,

Michael

ADD COMMENT
0
Entering edit mode

Thanks for your response.

Both hg19 gtf and cufflinks output (assembled transcrips) start with chr.

When I run tophat I also use hg19 reference genome.

best

Nazanin

ADD REPLY
0
Entering edit mode
8.6 years ago

It's not surprisingly that some FPKMs are 0 (depending on your total number of reads and tissue).

-Edited to replace 'a lot of' by 'some'-

ADD COMMENT
1
Entering edit mode

If most (as stated by @Nazanin) are zero then that may be an indication of an upstream problem :-). Assuming public galaxy has the right combination of reference and GTF files.

ADD REPLY
0
Entering edit mode

Right, edited my response to "some". I guess subjective quantitative terms can be misinterpreted easily ;)

ADD REPLY
0
Entering edit mode

You mean that if FPKM of "some" genes are zero, is OK, due to total number of reads and under studied tissue?

ADD REPLY
0
Entering edit mode

It may be ok since not every gene would be expressed/detected under all conditions. That said, can you be more specific? How many are zero (out of a total #) in your data?

ADD REPLY
0
Entering edit mode

Thanks for your comment.

Yes,your right. The file has 4082 zero out of 65536( or may be more, because my excel cannot show more).

best

Nazanin

ADD REPLY
1
Entering edit mode

If you can, avoid Excel as much as possible in bioinformatics analysis. Next to automated formatting (changing floats or gene names into date format), you may experience differences due to floating point arithmetics.

ADD REPLY
0
Entering edit mode

That looks like nothing to worry about. These genes are either lowly expressed (and therefore not sequenced) or just tissue-specific.

ADD REPLY

Login before adding your answer.

Traffic: 1995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6