Cufflinks analysis and gffread
0
0
Entering edit mode
7.1 years ago
qudrat ▴ 100

Hello all, I removed all single single exon transcripts from each of the transcriptome assemblies using gffread tool (part of Cufflinks package): gffread transcripts.gtf -T -U -o transcripts_multiexon.gtf. But when I was analyzing sequences individually I still found some sequences that were single exonic. Can somebody shed light on it?

RNA-Seq Assembly • 3.1k views
ADD COMMENT
0
Entering edit mode

Hey again qudrat, have you looked at both the -U and -C command-line parmeters?

Perhaps you could paste some examples of the sequences that remain in your GTF that you expected to be removed?

ADD REPLY
0
Entering edit mode

Hi Kevin, actually I did use gffread -U -T command. I am pasting one such sequence below

seq TGGCTATGATGGTGGGGATGATGAGGCTATTGTTTTTTGTGAATTCTTCGATAATGGCCCATTTGGGCAA AAAGCCGGTTAGCGGGGGCAGGCCTCCTAGGGAGAGGAGGGTGGATGGAATTAAGGGTGTTAGTCATGTT AGCTTGTTTCAGGTGCGAGATAGTAGTAGGGTTGTGGTGCTGGAGTTTAAGTTGAGTAGTAGGAATGCGG TAGTAGTTAGGATAATATAAATAGTTAAATTAAGAATGGTTATGTTAGGGTTGTACGGTAGAACTGCTGT TATTCATCCTATGTGGGTAATTGAGGAGTATGCTAAGATTTTGCGTAGTTGGGTTTGGTTTAATCCACCT CAACTGCCTGCCATGATGGATAAGATTGAAAGAGTGAGGAGAAGGCTTACGTTTAATGAGGGAGAAATTT GGTATATGATTGAGATGGGGGCTAGTTTTTGTCATGTGAGAAGGAGCAGGCCGGATGTCAGAGGGGTGCC TTGGGTAACCTCTGGGACTCAGAAGTGAAAGGGGGCTATTCCTAGTTTTATTGCTATAGCCATTATGATT ATTAATGATGAGTATTGATTGGTGGTATTGGTTATGGTTCATTGTCCGGAGAGTATATTGTTGAAGAGGA TAGCTATTAGAAGGATTATGGATGCGGTTGCTTGCGTGAGGAAATACTTGATGGCAGCTTCTGTGGAACG AGGGTTTATTTTTTTGGTTAGAACTGGAATAAAAGCTAGCATGTTTATTTCTAGGCCTACTCAGGTAAAA AATCAGTGCGAGCTTAGCGCTGTGATGAGTGTGCCTGCAAAGATGGTAGAGTAGATGACGGGTTGGGCCA GGGGATTAATTAGTACGGGAAGGATATAACCAACATTTTCGGGGTATGGGCCCGATAGCTTATTTAGCTG ACCTTACTTtaggatggggtgtgataggtggcacggagaattttggattctcagggatgggttcgattct catagtcctagAAATAAGGGGATTTAAACTCCTATTATTTACTCTATCAAAGTAACTCTTTTATCAGACA TATTTCTTAGGTTTGAGGGGGAATGCTGGAGATTGTAATGGGTATGGAGACATATCATATAAGTAATGCT AGGGTGAGTGGTAGGAAGTTTTTTCttttttttttttttttttttttttttttttttttttttttttgag ac

ADD REPLY
0
Entering edit mode

Hi qudrat,

Do you have the FASTA header and/or the GTF entries that you believe should have been removed?

ADD REPLY
0
Entering edit mode

Hi Kevin, Yes It has fasta header and the GTF entries as well.

ADD REPLY
0
Entering edit mode

No, I mean, can you share them here? Are they single base-pair exons?

ADD REPLY
0
Entering edit mode
>TCONS_00012328 gene=XLOC_002003
TGGCTATGATGGTGGGGATGATGAGGCTATTGTTTTTTGTGAATTCTTCGATAATGGCCCATTTGGGCAA
AAAGCCGGTTAGCGGGGGCAGGCCTCCTAGGGAGAGGAGGGTGGATGGAATTAAGGGTGTTAGTCATGTT
AGCTTGTTTCAGGTGCGAGATAGTAGTAGGGTTGTGGTGCTGGAGTTTAAGTTGAGTAGTAGGAATGCGG
TAGTAGTTAGGATAATATAAATAGTTAAATTAAGAATGGTTATGTTAGGGTTGTACGGTAGAACTGCTGT
TATTCATCCTATGTGGGTAATTGAGGAGTATGCTAAGATTTTGCGTAGTTGGGTTTGGTTTAATCCACCT
CAACTGCCTGCCATGATGGATAAGATTGAAAGAGTGAGGAGAAGGCTTACGTTTAATGAGGGAGAAATTT
GGTATATGATTGAGATGGGGGCTAGTTTTTGTCATGTGAGAAGGAGCAGGCCGGATGTCAGAGGGGTGCC
TTGGGTAACCTCTGGGACTCAGAAGTGAAAGGGGGCTATTCCTAGTTTTATTGCTATAGCCATTATGATT
ATTAATGATGAGTATTGATTGGTGGTATTGGTTATGGTTCATTGTCCGGAGAGTATATTGTTGAAGAGGA
TAGCTATTAGAAGGATTATGGATGCGGTTGCTTGCGTGAGGAAATACTTGATGGCAGCTTCTGTGGAACG
AGGGTTTATTTTTTTGGTTAGAACTGGAATAAAAGCTAGCATGTTTATTTCTAGGCCTACTCAGGTAAAA
AATCAGTGCGAGCTTAGCGCTGTGATGAGTGTGCCTGCAAAGATGGTAGAGTAGATGACGGGTTGGGCCA
GGGGATTAATTAGTACGGGAAGGATATAACCAACATTTTCGGGGTATGGGCCCGATAGCTTATTTAGCTG
ACCTTACTTtaggatggggtgtgataggtggcacggagaattttggattctcagggatgggttcgattct
catagtcctagAAATAAGGGGATTTAAACTCCTATTATTTACTCTATCAAAGTAACTCTTTTATCAGACA
TATTTCTTAGGTTTGAGGGGGAATGCTGGAGATTGTAATGGGTATGGAGACATATCATATAAGTAATGCT
AGGGTGAGTGGTAGGAAGTTTTTTCttttttttttttttttttttttttttttttttttttttttttgag
ac
ADD REPLY
0
Entering edit mode

Hi qudrat,

Can you also share the GTF file entries? - those are key.

I just tested it on my computer and the -U switch works fine with gffread, i.e., It does not include single-exon transcripts in my GTF file:

cat mergedAll.gtf 
BacterialGenome.1   Cufflinks   exon    2   500 .   +   .   gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.10.1"; tss_id "TSS1";
BacterialGenome.1   Cufflinks   exon    2   500 .   -   .   gene_id "XLOC_000628"; transcript_id "TCONS_00000630"; exon_number "1"; oId "CUFF.1.1"; tss_id "TSS630";
ADD REPLY
0
Entering edit mode

Hi Kavin,

chr1    StringTie   transcript  533488  630489  0   -   0   transcript_id "TCONS_00012328"   gene_id "XLOC_002003"   oId "STRG.11.1"     tss_id "TSS6802"
ADD REPLY
0
Entering edit mode

So, that's the problem. There's no information that says that that is a single exon. That line relates to a 'transcript', which may or may not have a single exon - we're not to know how many exons are contained within it.

You may want to check the -C, -J, and -E command-line switches.

ADD REPLY
0
Entering edit mode

Hi Kevin, Actually this is not a single exon transcript, it has three exon but when I use gffread to get fasta sequence it gives the sequence of only third exon. There are few more such transcripts.

ADD REPLY
1
Entering edit mode

I see, can you paste the entire original transcript from the GTF, i.e., the transcript that has the 3 exons?

ADD REPLY
0
Entering edit mode

Hi Kevin,

chr1 StringTie transcript 533488 630489 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" oId "STRG.11.1" tss_id "TSS6802"

chr1 StringTie exon 533488 533516 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" exon_number "1"

chr1 StringTie exon 591734 591751 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" exon_number "2"

chr1 StringTie exon 629345 630489 0 - 0 transcript_id "TCONS_00012328" gene_id "XLOC_002003" exon_number "3"

ADD REPLY
1
Entering edit mode

Are you sure that your file is formatted correctly? I am using gffread as part of cufflinks 2.2.1.

I have different outputs from my commands, but note that I had to edit the GTF/GFF file entries that you pasted above because they are not in the correct GTF/GFF format.

cat test.gtf
chr1    StringTie   transcript  533488  630489  0   -   0   transcript_id "TCONS_00012328"; gene_id "XLOC_002003"; oId "STRG.11.1"; tss_id "TSS6802";
chr1    StringTie   exon    533488  533516  0   -   0   transcript_id "TCONS_00012328"; gene_id "XLOC_002003"; exon_number "1";
chr1    StringTie   exon    591734  591751  0   -   0   transcript_id "TCONS_00012328"; gene_id "XLOC_002003"; exon_number "2";
chr1    StringTie   exon    629345  630489  0   -   0   transcript_id "TCONS_00012328"; gene_id "XLOC_002003"; exon_number "3";

.

gffread test.gtf -o-
# gffread test.gtf -o-
##gff-version 3
chr1    StringTie   transcript  533488  630489  .   -   .   ID=TCONS_00012328;geneID=XLOC_002003
chr1    StringTie   exon    533488  533516  .   -   .   Parent=TCONS_00012328
chr1    StringTie   exon    591734  591751  .   -   .   Parent=TCONS_00012328
chr1    StringTie   exon    629345  630489  .   -   .   Parent=TCONS_00012328

.

gffread test.gtf -T -U -o-
chr1    StringTie   exon    533488  533516  .   -   .   transcript_id "TCONS_00012328"; gene_id "XLOC_002003";
chr1    StringTie   exon    591734  591751  .   -   .   transcript_id "TCONS_00012328"; gene_id "XLOC_002003";
chr1    StringTie   exon    629345  630489  .   -   .   transcript_id "TCONS_00012328"; gene_id "XLOC_002003";

Note that my -U switch does not remove these entries because they are not single exons. They form a 3-exon gene.

ADD REPLY
0
Entering edit mode

It is correctly formatted but I really do not understand why only few of such transcripts giving only sequence of one exon.

ADD REPLY
0
Entering edit mode

Is there any way to extract sequence of just one transcript, like above aformentioned transcript?

ADD REPLY
1
Entering edit mode

Yes, you can always just edit the GTF/GFF file and only keep the regions that you want, and then extract FASTA sequence over these.

It's perfectly fine to do things like that if you know exactly what you're doing, obviously. It helps to think flexibly like that because bioinformatics tools don't always function as we would expect.

ADD REPLY
1
Entering edit mode

Thank you very much Kevin!

ADD REPLY

Login before adding your answer.

Traffic: 2085 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6