I followed similar steps mentioned in the procedure part of Trapnell et al, 2012 for an RNASeq analysis of oryza sativa datasets. The problem I face is in the Cuffdiff output, where more that one FPKM value is reported for many genes as below,
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
XLOC_005901 XLOC_005901 LOC_Os01g46440 Chr1:26422826-26425093 wild mutant OK 2.48243 8.09808 1.70582 0.193218 0.35875 0.999974 no
XLOC_002129 XLOC_002129 LOC_Os01g46440 Chr1:26422826-26425093 wild mutant OK 26.4118 20.9721 -0.332716 -0.280221 0.63665 0.999974 no
XLOC_003921 XLOC_003921 LOC_Os01g03040 Chr1:1159160-1164635 wild mutant OK 72.5969 77.3011 0.09058 0.0824954 0.8934 0.999974 no
XLOC_003922 XLOC_003922 LOC_Os01g03040 Chr1:1159160-1164635 wild mutant NOTEST 0.40255 0.306853 -0.391618 0 1 1 no
I noticed that some threads like this discussed similar issues. But this case is different since 1) Not all genes with alternative spliced forms are reporting multiple FPKM 2) There is a big difference between the multiple time reported FPKM.
I have downloaded gff3 file and genome sequence info. fromRGAP MSU database ftp . Tophat is of version 2.
Can anyone suggest me exactly what is going wrong here?
OR
will be required to run again with annotation and genome index file from sources such asiGenomes? iGenomes?
Oh, you have different ids mapping to the same locus, chromosome and nucleotide positions (start and end).
Processor failing could result in such fragmentation?Any previous experience? BTW I noticed that using similar commands with this annotation file gave me unfragmented FPKM previously, means its not a problem with annotation file @theobromma22
It is well-known that a single gene locus can transcribe more than one gene, or mRNA via alternative splicing. So, it seems this is what is happening in your case. From the literature you have several options. You can keep one of those genes, take the average of those genes to represent a single expression for that locus or separate them. Which option you choose is dependent on your overall research goal.
I forgot to mention that this can be done programmatically or manually, just remember to write how you did this bit in your M&M section. Also, using the first option it's obvious that you should choose the one that has the highest expression level, or fold-change values.
Thanks for the direction@theobroma22