Transcript fasta created using bedtools is incorrect
1
0
Entering edit mode
8.3 years ago
mosquitoes • 0

I am using bedops to convert a .gff file to .bed file and then bedtools getfasta to create a transcript fasta file, but I am not getting the expected result in the transcript fasta file. The mRNA sequence in the fasta also contains introns despite using the -split option.

My .gff file looks like this:

Pv_Sal1_chr08   PlasmoDB        gene    1540281 1541168 .       +       .       ID=PVX_119350;Name=PVX_119350;description=1-cys-glutaredoxin-like+protein-1%2C+putative;size=888;web_id=PVX_119350;locus_tag=PVX_119350;size=888;Alias=156094059,14578309,A5KB62,5472321,Pv119350,PVX_119350,148801941
Pv_Sal1_chr08   PlasmoDB        mRNA    1540281 1541168 .       +       .       ID=rna_PVX_119350-1;Name=PVX_119350-1;description=PVX_119350-1;size=888;Parent=PVX_119350;Ontology_term=GO:0045454,GO:0009055,GO:0015035,GO:0015038,GO:0006118;Dbxref=ApiDB_PlasmoDB:PVX_119350,EC:1.11.1.15,NCBI_gi:14578309,NCBI_gi:148801941,NCBI_gi:156094059,taxon:126793
Pv_Sal1_chr08   PlasmoDB        CDS     1540965 1541168 .       +       0       ID=cds_PVX_119350-3;Name=cds;description=.;size=204;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        CDS     1540670 1540768 .       +       0       ID=cds_PVX_119350-2;Name=cds;description=.;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        CDS     1540284 1540487 .       +       0       ID=cds_PVX_119350-1;Name=cds;description=.;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        exon    1540281 1540487 .       +       .       ID=exon_PVX_119350-1;Name=exon;description=exon;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        exon    1540670 1540768 .       +       .       ID=exon_PVX_119350-2;Name=exon;description=exon;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        exon    1540965 1541168 .       +       .       ID=exon_PVX_119350-3;Name=exon;description=exon;size=204;Parent=rna_PVX_119350-1

And the resulting .bed file looks like this:

    Pv_Sal1_chr08   1540280 1540487 exon_PVX_119350-1       .       +       PlasmoDB        exon    .       ID=exon_PVX_119350-1;Name=exon;description=exon;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540280 1541168 PVX_119350      .       +       PlasmoDB        gene    .       ID=PVX_119350;Name=PVX_119350;description=1-cys-glutaredoxin-like+protein-1%2C+putative;size=888;web_id=PVX_119350;locus_tag=PVX_119350;size=888;Alias=156094059,14578309,A5KB62,5472321,Pv119350,PVX_119350,148801941
Pv_Sal1_chr08   1540280 1541168 rna_PVX_119350-1        .       +       PlasmoDB        mRNA    .       ID=rna_PVX_119350-1;Name=PVX_119350-1;description=PVX_119350-1;size=888;Parent=PVX_119350;Ontology_term=GO:0045454,GO:0009055,GO:0015035,GO:0015038,GO:0006118;Dbxref=ApiDB_PlasmoDB:PVX_119350,EC:1.11.1.15,NCBI_gi:14578309,NCBI_gi:148801941,NCBI_gi:1560940
Pv_Sal1_chr08   1540283 1540487 cds_PVX_119350-1        .       +       PlasmoDB        CDS     0       ID=cds_PVX_119350-1;Name=cds;description=.;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540669 1540768 cds_PVX_119350-2        .       +       PlasmoDB        CDS     0       ID=cds_PVX_119350-2;Name=cds;description=.;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540669 1540768 exon_PVX_119350-2       .       +       PlasmoDB        exon    .       ID=exon_PVX_119350-2;Name=exon;description=exon;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540964 1541168 cds_PVX_119350-3        .       +       PlasmoDB        CDS     0       ID=cds_PVX_119350-3;Name=cds;description=.;size=204;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540964 1541168 exon_PVX_119350-3       .       +       PlasmoDB        exon    .       ID=exon_PVX_119350-3;Name=exon;description=exon;size=204;Parent=rna_PVX_119350-1

If I run the code:

~/bin/bedtools2/bin/bedtools getfasta -split -name -s -fi PlasmoDB-25_PvivaxSal1_Genome.fasta -bed PlasmoDB-25_PvivaxSal1.bed

My output file is unexpected. For instance both the gene and mRNA sequence are the same despite this gene containing two introns.

>PVX_119350::Pv_Sal1_chr08:1540280-1541168(+)
ACAATGATAGTAAGAAACAGTTGTAAGATCCTCCTTCGCATGAATCGAAGCATAGTGAGG
AGAAACCTAACGTTCTTTTCTTGCAATCAGGTTTCAAAAGCGAACTTCAGCAGTTCCCTT
CAAGATAAGCAAAATGGAGGACCCGAGAAGAAGATAAGCAACGGCTCGGACGACTTTAAA
GACTTTGAAAAGTCGGACGTGTACCAGGTTAACCCAGGGAGGGAAATGAGAGGGTGCGAA
TGAGATGGCACCTATGAGATGGTGCAATTTTCATTTTTAAAAATGATGCATCGTTTCCTA
TGTAAGCGCATTTTATTGGTGCCCAGCAGTTTGCCTCTATGCTAAATGACTCATGTCTGC
TTTTAACCCTTTTTGCTCCCCCTCCGTAGACTCTGAAGGGAAAAATTAAGGAAATCCTTG
AGAAGGAGAAAATTGTGCTGTTTATGAAGGGCACCCCGGAAAGTCCCCTGTGCGGGTTTA
GCGCAAAGGTGAGTTTCATGTATCTGCCGTGGGGTGGTTTTGTCCAACTGACCCCAGTTT
TTCCCACGTTTGAGAAACTTACATTATTGTTTTTTGGACTAATATGCACCCAGGAGAGGC
ACATTTTAGTATGCCCATTTTGTGCAGTGGCTAGCTTTATTATTTTTTTTTTATTTTTTT
TTTTTTCCCCTTTCCCCTGCGCAGGTGGTCCACATACTGAATAGTATGAACGTGAAGGAC
TACGTATACATCGACGTGATGAAAAACAACAACTTGCGCGAGGCGATAAAGATTTACAGC
AACTGGCCCTACATCCCCCACTTATACGTAAATAACAACTTCATCGGTGGCTACGACATC
ATCTCAGATTTGTACACCAGTGGAGAGCTACAGGGGCTGGTCAAGTAA
>rna_PVX_119350-1::Pv_Sal1_chr08:1540280-1541168(+)
ACAATGATAGTAAGAAACAGTTGTAAGATCCTCCTTCGCATGAATCGAAGCATAGTGAGG
AGAAACCTAACGTTCTTTTCTTGCAATCAGGTTTCAAAAGCGAACTTCAGCAGTTCCCTT
CAAGATAAGCAAAATGGAGGACCCGAGAAGAAGATAAGCAACGGCTCGGACGACTTTAAA
GACTTTGAAAAGTCGGACGTGTACCAGGTTAACCCAGGGAGGGAAATGAGAGGGTGCGAA
TGAGATGGCACCTATGAGATGGTGCAATTTTCATTTTTAAAAATGATGCATCGTTTCCTA
TGTAAGCGCATTTTATTGGTGCCCAGCAGTTTGCCTCTATGCTAAATGACTCATGTCTGC
TTTTAACCCTTTTTGCTCCCCCTCCGTAGACTCTGAAGGGAAAAATTAAGGAAATCCTTG
AGAAGGAGAAAATTGTGCTGTTTATGAAGGGCACCCCGGAAAGTCCCCTGTGCGGGTTTA
GCGCAAAGGTGAGTTTCATGTATCTGCCGTGGGGTGGTTTTGTCCAACTGACCCCAGTTT
TTCCCACGTTTGAGAAACTTACATTATTGTTTTTTGGACTAATATGCACCCAGGAGAGGC
ACATTTTAGTATGCCCATTTTGTGCAGTGGCTAGCTTTATTATTTTTTTTTTATTTTTTT
TTTTTTCCCCTTTCCCCTGCGCAGGTGGTCCACATACTGAATAGTATGAACGTGAAGGAC
TACGTATACATCGACGTGATGAAAAACAACAACTTGCGCGAGGCGATAAAGATTTACAGC
AACTGGCCCTACATCCCCCACTTATACGTAAATAACAACTTCATCGGTGGCTACGACATC
ATCTCAGATTTGTACACCAGTGGAGAGCTACAGGGGCTGGTCAAGTAA
bedtools bedops getfasta gffutils bed12 • 2.3k views
ADD COMMENT
0
Entering edit mode

I've tried gffutils to create a bed12 file from the gff.

I'm trying to convert a .gff file to a .bed12 file format using gffutils. I'm missing something about the syntax for the command though:

import gffutils
from sys import argv

script, database = argv

db = gffutils.FeatureDB(database)

db.bed12(block_featuretype=['exon'], thick_featuretype=['CDS'], thin_featuretype=None)

where database is correctly created gffutils.db. I get the error " TypeError: bed12() takes at least 2 arguments (2 given)"

ADD REPLY
0
Entering edit mode
8.3 years ago

The BED file that bedops is producing is only sort of a BED file (it's apparently BED6 with extra information). For example, it doesn't contain exon information in mRNA entries, so it's impossible for bedtools to do what you want. Your best bet is to use R with rtracklayer and GenomicFeatures. Those can be combined to do what you want.

ADD COMMENT
0
Entering edit mode

Thanks Devon, but I would like to solve this problem in python or bash.

ADD REPLY

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6