I need to find 3'-UTR region for each gene to predict miRNA targets but I only have a gff file without any UTR regions. Is there a tool I can use to identify 3'UTR regions?
(genome and RNA-seq data available)
Here is an example of my current gff file (CDS and exon start and end with the same position as gene and mRNA, no UTR)
Bany_Scaf21 B_anynana_v2 gene 6013946 6020193 . - . ID=BANY.1.2.t00009.path1;Name=BANY.1.2.t00009
Bany_Scaf21 B_anynana_v2 mRNA 6013946 6020193 . - . ID=BANY.1.2.t00009.mrna1;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.path1;coverage=99.1;identity=100.0;matches=229;mismatches=0;indels=0;unknowns=0
Bany_Scaf21 B_anynana_v2 exon 6020095 6020193 100 - . ID=BANY.1.2.t00009.mrna1.exon1;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 3 101 +
Bany_Scaf21 B_anynana_v2 exon 6016799 6016862 100 - . ID=BANY.1.2.t00009.mrna1.exon2;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 102 165 +
Bany_Scaf21 B_anynana_v2 exon 6014273 6014317 100 - . ID=BANY.1.2.t00009.mrna1.exon3;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 166 210 +
Bany_Scaf21 B_anynana_v2 exon 6013946 6013966 100 - . ID=BANY.1.2.t00009.mrna1.exon4;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 211 231 +
Bany_Scaf21 B_anynana_v2 CDS 6020095 6020192 100 - 0 ID=BANY.1.2.t00009.mrna1.cds1;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 4 101 +
Bany_Scaf21 B_anynana_v2 CDS 6016799 6016862 100 - 2 ID=BANY.1.2.t00009.mrna1.cds2;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 102 165 +
Bany_Scaf21 B_anynana_v2 CDS 6014273 6014317 100 - 0 ID=BANY.1.2.t00009.mrna1.cds3;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 166 210 +
Bany_Scaf21 B_anynana_v2 CDS 6013946 6013966 100 - 0 ID=BANY.1.2.t00009.mrna1.cds4;Name=BANY.1.2.t00009;Parent=BANY.1.2.t00009.mrna1;Target=BANY.1.2.t00009 211 231 +
Which genome/organism is this?
Hi Genomax, it's Bicyclus anynana (butterfly), a non-model organism. I don't think I can download it from the database.
Hi,
it's more a vague idea: You can try to find the PAS-signal and define the region between it and the CDS as 3' UTR. The signal is usually an A-rich hexamer. You may model the genome with a similar species' 3' UTR-length distribution to limit the search space.
Cheers,
Michael