Entering edit mode
7.1 years ago
qudrat
▴
100
Hello all, I removed all single single exon transcripts from each of the transcriptome assemblies using gffread tool (part of Cufflinks package): gffread transcripts.gtf -T -U -o transcripts_multiexon.gtf. But when I was analyzing sequences individually I still found some sequences that were single exonic. Can somebody shed light on it?
Hey again qudrat, have you looked at both the -U and -C command-line parmeters?
Perhaps you could paste some examples of the sequences that remain in your GTF that you expected to be removed?
Hi Kevin, actually I did use gffread -U -T command. I am pasting one such sequence below
Hi qudrat,
Do you have the FASTA header and/or the GTF entries that you believe should have been removed?
Hi Kevin, Yes It has fasta header and the GTF entries as well.
No, I mean, can you share them here? Are they single base-pair exons?
Hi qudrat,
Can you also share the GTF file entries? - those are key.
I just tested it on my computer and the -U switch works fine with
gffread
, i.e., It does not include single-exon transcripts in my GTF file:Hi Kavin,
So, that's the problem. There's no information that says that that is a single exon. That line relates to a 'transcript', which may or may not have a single exon - we're not to know how many exons are contained within it.
You may want to check the -C, -J, and -E command-line switches.
Hi Kevin, Actually this is not a single exon transcript, it has three exon but when I use gffread to get fasta sequence it gives the sequence of only third exon. There are few more such transcripts.
I see, can you paste the entire original transcript from the GTF, i.e., the transcript that has the 3 exons?
Hi Kevin,
chr1 StringTie transcript 533488 630489 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" oId "STRG.11.1" tss_id "TSS6802"
chr1 StringTie exon 533488 533516 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" exon_number "1"
chr1 StringTie exon 591734 591751 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" exon_number "2"
chr1 StringTie exon 629345 630489 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" exon_number "3"
Are you sure that your file is formatted correctly? I am using gffread as part of cufflinks 2.2.1.
I have different outputs from my commands, but note that I had to edit the GTF/GFF file entries that you pasted above because they are not in the correct GTF/GFF format.
.
.
Note that my -U switch does not remove these entries because they are not single exons. They form a 3-exon gene.
It is correctly formatted but I really do not understand why only few of such transcripts giving only sequence of one exon.
Is there any way to extract sequence of just one transcript, like above aformentioned transcript?
Yes, you can always just edit the GTF/GFF file and only keep the regions that you want, and then extract FASTA sequence over these.
It's perfectly fine to do things like that if you know exactly what you're doing, obviously. It helps to think flexibly like that because bioinformatics tools don't always function as we would expect.
Thank you very much Kevin!