I've performed RNA-Seq analysis for A. thaliana using the TopHat/Cufflinks pipeline for 2 conditions. How can I extract the sequences of all isoforms of the genes which are differentially alternatively spliced b/w the two conditions? (Significant genes as output by cuffdiff in the splicing.diff file)
While browsing through the splicing.diff output file, I found that the coordinates for genes mentioned in the file are nowhere to be found in the transcript.gtf file produced by Cufflinks.
Sample splicing.diff file:
TSS1 XLOC_000001 AT1G01010 Chr1:3630-5899 WT20 WT37 NOTEST 0 0 0 0 1 1 yes
TSS10 XLOC_000007 AT1G01180 Chr1:75582-76758 WT20 WT37 NOTEST 0 0 0 0 1 1 yes
Sample transcript.gtf file:
Chr1 Cufflinks transcript 11903 12897 1000 - . gene_id "WT_37.1"; transcript_id "WT_37.1.1"; FPKM "0.6185344748"; frac "1.000000"; conf_lo "0.341917"; conf_hi "0.895152"; cov "2.561616";
Chr1 Cufflinks exon 11903 12897 1000 - . gene_id "WT_37.1"; transcript_id "WT_37.1.1"; exon_number "1"; FPKM "0.6185344748"; frac "1.000000"; conf_lo "0.341917"; conf_hi "0.895152"; cov "2.561616";
As you can see, the gene_id
in the transcript.gtf
file is also not same as that in the splicing.diff
file, which further complicates extraction. The coordinates in splicing.diff
files are usually not present in the transcript.gtf
file. Either of the start or end coordinates is few always ~100nt apart while the other end is much further away.
I've tried gffreads, but it only extracts sequences for the transcript.gtf file for both conditions and not differentially spliced between the two conditions. The GeneID's in gtf and splicing.diff files are different, so it gets more confusing.
Convert the
splicing.diff
to bed format and usebedtools getfasta
?Thanks for the suggestion. I can do that, but it'll give me the entire gene sequence for all differentially spliced genes in
splicing.diff
and not all the isoforms of these genes per se. Any way to get the isoforms from both conditions for these differentially spliced genes?