infer 3' and 5' UTR from gff3 file
1
1
Entering edit mode
8.3 years ago
Chris ▴ 30

Hi, I have a gff3 file wtih info: CDS exon intron and parent genes.

I am interesting to define 5' and 3' UTR regions. I can use bedtools subtract to get the UTR regions but I am not sure if I am in the right way since I can not take into consideration the orientation of the genes and I am not able to define 5' or 3' UTRs

If you have any other alternative solution I would highly appreciate it.

thank you all for your help!

So far I have used: awk '/exon/ {print $0}' all_maker_genes.gff > all_genes_exons.gff and awk '/CDS/ {print $0}' all_maker_genes.gff > all_genes_CDS.gff followed by

bedtools subtract -a all_genes_exons.gff -b all_genes_CDS.gff > UTR_regions.gff

LperrChr01 ensembl chromosome 1 32922458 . . . ID=LperrChr01;Name=chromosome:v1.2-2013Aug:1:1:32922458:1 LperrChr01 ensembl gene 5300 10998 . + . ID=Lperr01g00010;Name=Lperr01g00010;biotype=protein_coding LperrChr01 ensembl mRNA 5300 10998 . + . ID=Lperr01g00010.1;Parent=Lperr01g00010;Name=Lperr01g00010.1;biotype=protein_coding LperrChr01 ensembl intron 5519 5662 . + . Parent=Lperr01g00010.1;Name=intron.1 LperrChr01 ensembl intron 5773 5857 . + . Parent=Lperr01g00010.1;Name=intron.2 LperrChr01 ensembl intron 5916 6451 . + . Parent=Lperr01g00010.1;Name=intron.3 LperrChr01 ensembl intron 6716 6784 . + . Parent=Lperr01g00010.1;Name=intron.4 LperrChr01 ensembl intron 6923 6998 . + . Parent=Lperr01g00010.1;Name=intron.5 LperrChr01 ensembl intron 7051 7126 . + . Parent=Lperr01g00010.1;Name=intron.6 LperrChr01 ensembl intron 7207 8676 . + . Parent=Lperr01g00010.1;Name=intron.7 LperrChr01 ensembl intron 8737 9649 . + . Parent=Lperr01g00010.1;Name=intron.8 LperrChr01 ensembl intron 9731 9896 . + . Parent=Lperr01g00010.1;Name=intron.9 LperrChr01 ensembl intron 10465 10782 . + . Parent=Lperr01g00010.1;Name=intron.10 LperrChr01 ensembl exon 5300 5518 . + . Parent=Lperr01g00010.1;Name=exon.11 LperrChr01 ensembl exon 5663 5772 . + . Parent=Lperr01g00010.1;Name=exon.12 LperrChr01 ensembl exon 5858 5915 . + . Parent=Lperr01g00010.1;Name=exon.13 LperrChr01 ensembl exon 6452 6715 . + . Parent=Lperr01g00010.1;Name=exon.14 LperrChr01 ensembl exon 6785 6922 . + . Parent=Lperr01g00010.1;Name=exon.15 LperrChr01 ensembl exon 6999 7050 . + . Parent=Lperr01g00010.1;Name=exon.16 LperrChr01 ensembl exon 7127 7206 . + . Parent=Lperr01g00010.1;Name=exon.17 LperrChr01 ensembl exon 8677 8736 . + . Parent=Lperr01g00010.1;Name=exon.18 LperrChr01 ensembl exon 9650 9730 . + . Parent=Lperr01g00010.1;Name=exon.19 LperrChr01 ensembl exon 9897 10464 . + . Parent=Lperr01g00010.1;Name=exon.20 LperrChr01 ensembl exon 10783 10998 . + . Parent=Lperr01g00010.1;Name=exon.21 LperrChr01 ensembl CDS 5300 5518 . + 0 Parent=Lperr01g00010.1;Name=CDS.22 LperrChr01 ensembl CDS 5663 5772 . + 0 Parent=Lperr01g00010.1;Name=CDS.23 LperrChr01 ensembl CDS 5858 5915 . + 2 Parent=Lperr01g00010.1;Name=CDS.24 LperrChr01 ensembl CDS 6452 6715 . + 0 Parent=Lperr01g00010.1;Name=CDS.25 LperrChr01 ensembl CDS 6785 6922 . + 0 Parent=Lperr01g00010.1;Name=CDS.26 LperrChr01 ensembl CDS 6999 7050 . + 0 Parent=Lperr01g00010.1;Name=CDS.27 LperrChr01 ensembl CDS 7127 7206 . + 1 Parent=Lperr01g00010.1;Name=CDS.28 LperrChr01 ensembl CDS 8677 8736 . + 0 Parent=Lperr01g00010.1;Name=CDS.29 LperrChr01 ensembl CDS 9650 9730 . + 0 Parent=Lperr01g00010.1;Name=CDS.30 LperrChr01 ensembl CDS 9897 9989 . + 0 Parent=Lperr01g00010.1;Name=CDS.31 LperrChr01 ensembl gene 11136 21882 . + . ID=Lperr01g00020;Name=Lperr01g00020;biotype=protein_coding LperrChr01 ensembl mRNA 11149 21882 . + . ID=Lperr01g00020.1;Parent=Lperr01g00020;Name=Lperr01g00020.1;biotype=protein_coding LperrChr01 ensembl intron 11455 11523 . + . Parent=Lperr01g00020.1;Name=intron.33 LperrChr01 ensembl intron 11744 12385 . + . Parent=Lperr01g00020.1;Name=intron.34 LperrChr01 ensembl intron 12485 13421 . + . Parent=Lperr01g00020.1;Name=intron.35 LperrChr01 ensembl intron 13526 15162 . + . Parent=Lperr01g00020.1;Name=intron.36 LperrChr01 ensembl intron 15972 16065 . + . Parent=Lperr01g00020.1;Name=intron.37 LperrChr01 ensembl intron 16189 16276 . + . Parent=Lperr01g00020.1;Name=intron.38 LperrChr01 ensembl intron 16366 16453 . + . Parent=Lperr01g00020.1;Name=intron.39 LperrChr01 ensembl intron 16655 17061 . + . Parent=Lperr01g00020.1;Name=intron.40 LperrChr01 ensembl intron 17416 17675 . + . Parent=Lperr01g00020.1;Name=intron.41 LperrChr01 ensembl intron 17760 18102 . + . Parent=Lperr01g00020.1;Name=intron.42 LperrChr01 ensembl intron 18189 18829 . + . Parent=Lperr01g00020.1;Name=intron.43 LperrChr01 ensembl intron 20276 20361 . + . Parent=Lperr01g00020.1;Name=intron.44 LperrChr01 ensembl intron 20441 21005 . + . Parent=Lperr01g00020.1;Name=intron.45 LperrChr01 ensembl intron 21209 21301 . + . Parent=Lperr01g00020.1;Name=intron.46 LperrChr01 ensembl exon 11149 11454 . + . Parent=Lperr01g00020.1;Name=exon.47 LperrChr01 ensembl exon 11524 11743 . + . Parent=Lperr01g00020.1;Name=exon.48 LperrChr01 ensembl exon 12386 12484 . + . Parent=Lperr01g00020.1;Name=exon.49 LperrChr01 ensembl exon 13422 13525 . + . Parent=Lperr01g00020.1;Name=exon.50 LperrChr01 ensembl exon 15163 15971 . + . Parent=Lperr01g00020.1;Name=exon.51 LperrChr01 ensembl exon 16066 16188 . + . Parent=Lperr01g00020.1;Name=exon.52 LperrChr01 ensembl exon 16277 16365 . + . Parent=Lperr01g00020.1;Name=exon.53 LperrChr01 ensembl exon 16454 16654 . + . Parent=Lperr01g00020.1;Name=exon.54 LperrChr01 ensembl exon 17062 17415 . + . Parent=Lperr01g00020.1;Name=exon.55 LperrChr01 ensembl exon 17676 17759 . + . Parent=Lperr01g00020.1;Name=exon.56 LperrChr01 ensembl exon 18103 18188 . + . Parent=Lperr01g00020.1;Name=exon.57 LperrChr01 ensembl exon 18830 20275 . + . Parent=Lperr01g00020.1;Name=exon.58 LperrChr01 ensembl exon 20362 20440 . + . Parent=Lperr01g00020.1;Name=exon.59 LperrChr01 ensembl exon 21006 21208 . + . Parent=Lperr01g00020.1;Name=exon.60 LperrChr01 ensembl exon 21302 21882 . + . Parent=Lperr01g00020.1;Name=exon.61 LperrChr01 ensembl CDS 11582 11743 . + . Parent=Lperr01g00020.1;Name=CDS.63 LperrChr01 ensembl CDS 12386 12484 . + . Parent=Lperr01g00020.1;Name=CDS.64 LperrChr01 ensembl CDS 13422 13525 . + 2 Parent=Lperr01g00020.1;Name=CDS.65 LperrChr01 ensembl CDS 15163 15971 . + 1 Parent=Lperr01g00020.1;Name=CDS.66 LperrChr01 ensembl CDS 16066 16188 . + 0 Parent=Lperr01g00020.1;Name=CDS.67 LperrChr01 ensembl CDS 16277 16365 . + 0 Parent=Lperr01g00020.1;Name=CDS.68 LperrChr01 ensembl CDS 16454 16654 . + 2 Parent=Lperr01g00020.1;Name=CDS.69 LperrChr01 ensembl CDS 17062 17415 . + 2 Parent=Lperr01g00020.1;Name=CDS.70 LperrChr01 ensembl CDS 17676 17759 . + 2 Parent=Lperr01g00020.1;Name=CDS.71 LperrChr01 ensembl CDS 18103 18188 . + 2 Parent=Lperr01g00020.1;Name=CDS.72 LperrChr01 ensembl CDS 18830 20275 . + 1 Parent=Lperr01g00020.1;Name=CDS.73 LperrChr01 ensembl CDS 20362 20440 . + 1 Parent=Lperr01g00020.1;Name=CDS.74 LperrChr01 ensembl CDS 21006 21208 . + 2 Parent=Lperr01g00020.1;Name=CDS.75 LperrChr01 ensembl CDS 21302 21395 . + 1 Parent=Lperr01g00020.1;Name=CDS.76 LperrChr01 ensembl gene 33568 35397 . + . ID=Lperr01g00050;Name=Lperr01g00050;biotype=protein_coding LperrChr01 ensembl mRNA 33568 35397 . + . ID=Lperr01g00050.1;Parent=Lperr01g00050;Name=Lperr01g00050.1;biotype=protein_coding LperrChr01 ensembl intron 33720 33785 . + . Parent=Lperr01g00050.1;Name=intron.196 LperrChr01 ensembl intron 34058 34419 . + . Parent=Lperr01g00050.1;Name=intron.197 LperrChr01 ensembl intron 34624 34703 . + . Parent=Lperr01g00050.1;Name=intron.198 LperrChr01 ensembl exon 33568 33719 . + . Parent=Lperr01g00050.1;Name=exon.199 LperrChr01 ensembl exon 33786 34057 . + . Parent=Lperr01g00050.1;Name=exon.200 LperrChr01 ensembl exon 34420 34623 . + . Parent=Lperr01g00050.1;Name=exon.201 LperrChr01 ensembl exon 34704 35397 . + . Parent=Lperr01g00050.1;Name=exon.202 LperrChr01 ensembl CDS 33648 33719 . + . Parent=Lperr01g00050.1;Name=CDS.203 LperrChr01 ensembl CDS 33786 34057 . + . Parent=Lperr01g00050.1;Name=CDS.204 LperrChr01 ensembl CDS 34420 34623 . + 1 Parent=Lperr01g00050.1;Name=CDS.205 LperrChr01 ensembl CDS 34704 34758 . + 1 Parent=Lperr01g00050.1;Name=CDS.206 LperrChr01 ensembl gene 48331 51794 . - . ID=Lperr01g00080;Name=Lperr01g00080;biotype=protein_coding LperrChr01 ensembl mRNA 48331 51794 . - . ID=Lperr01g00080.1;Parent=Lperr01g00080;Name=Lperr01g00080.1;biotype=protein_coding LperrChr01 ensembl intron 51236 51459 . - . Parent=Lperr01g00080.1;Name=intron.472 LperrChr01 ensembl intron 50965 51051 . - . Parent=Lperr01g00080.1;Name=intron.473 LperrChr01 ensembl intron 50612 50713 . - . Parent=Lperr01g00080.1;Name=intron.474 LperrChr01 ensembl intron 50280 50496 . - . Parent=Lperr01g00080.1;Name=intron.475 LperrChr01 ensembl intron 49888 50186 . - . Parent=Lperr01g00080.1;Name=intron.476 LperrChr01 ensembl intron 49651 49743 . - . Parent=Lperr01g00080.1;Name=intron.477 LperrChr01 ensembl intron 48499 49554 . - . Parent=Lperr01g00080.1;Name=intron.478 LperrChr01 ensembl exon 51460 51794 . - . Parent=Lperr01g00080.1;Name=exon.479 LperrChr01 ensembl exon 51052 51235 . - . Parent=Lperr01g00080.1;Name=exon.480 LperrChr01 ensembl exon 50714 50964 . - . Parent=Lperr01g00080.1;Name=exon.481 LperrChr01 ensembl exon 50497 50611 . - . Parent=Lperr01g00080.1;Name=exon.482 LperrChr01 ensembl exon 50187 50279 . - . Parent=Lperr01g00080.1;Name=exon.483 LperrChr01 ensembl exon 49744 49887 . - . Parent=Lperr01g00080.1;Name=exon.484 LperrChr01 ensembl exon 49555 49650 . - . Parent=Lperr01g00080.1;Name=exon.485 LperrChr01 ensembl exon 48331 48498 . - . Parent=Lperr01g00080.1;Name=exon.486 LperrChr01 ensembl CDS 51460 51794 . - 0 Parent=Lperr01g00080.1;Name=CDS.487 LperrChr01 ensembl CDS 51052 51235 . - 2 Parent=Lperr01g00080.1;Name=CDS.488 LperrChr01 ensembl CDS 50714 50964 . - 0 Parent=Lperr01g00080.1;Name=CDS.489 LperrChr01 ensembl CDS 50497 50611 . - 2 Parent=Lperr01g00080.1;Name=CDS.490 LperrChr01 ensembl CDS 50187 50279 . - 0 Parent=Lperr01g00080.1;Name=CDS.491 LperrChr01 ensembl CDS 49744 49887 . - 0 Parent=Lperr01g00080.1;Name=CDS.492 LperrChr01 ensembl CDS 49555 49650 . - 0 Parent=Lperr01g00080.1;Name=CDS.493 LperrChr01 ensembl CDS 48331 48498 . - 0 Parent=Lperr01g00080.1;Name=CDS.494 LperrChr01 ensembl gene 97802 98541 . - . ID=Lperr01g00160;Name=Lperr01g00160;biotype=protein_coding LperrChr01 ensembl mRNA 97802 98541 . - . ID=Lperr01g00160.1;Parent=Lperr01g00160;Name=Lperr01g00160.1;biotype=protein_coding LperrChr01 ensembl intron 98144 98308 . - . Parent=Lperr01g00160.1;Name=intron.750 LperrChr01 ensembl intron 97884 97984 . - . Parent=Lperr01g00160.1;Name=intron.751 LperrChr01 ensembl exon 98309 98541 . - . Parent=Lperr01g00160.1;Name=exon.752 LperrChr01 ensembl exon 97985 98143 . - . Parent=Lperr01g00160.1;Name=exon.753 LperrChr01 ensembl exon 97802 97883 . - . Parent=Lperr01g00160.1;Name=exon.754 LperrChr01 ensembl CDS 98309 98541 . - 0 Parent=Lperr01g00160.1;Name=CDS.755 LperrChr01 ensembl CDS 97985 98143 . - 2 Parent=Lperr01g00160.1;Name=CDS.756 LperrChr01 ensembl CDS 97802 97883 . - 2 Parent=Lperr01g00160.1;Name=CDS.757

next-gen genome utr gff3 • 4.4k views
ADD COMMENT
1
Entering edit mode

In one of the rare cases where the gff3 is perfect. I typically use gff3ToGenePred utility (available here http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ ) to convert gff3 to GenePred format and the use another utility genePredToBed (also available there) to convert to bed. After that it is much easy to get specific regions. But success has been rare and dependent on how perfect is the gff3 formatted.

ADD REPLY
0
Entering edit mode

Hi I have tried the approach, although I am setting -maxParseErrors=-1 -maxConvertErrors=-1 which is what they say at the manual to avoid errors I still get errors and my job is never done...

Any help will be great!

thanks

ADD REPLY
0
Entering edit mode

Sorry that I am avoiding parsing the gff3 file. I presume that the genome is Leersia Perrieri and the gff3 file is from ensemble. You could try ensemble plants -> biomart -> Leersia Perri Then select attributes -> sequences from the left hand side menu -> Select 5' UTR or 3' UTR -> in the header select 5' UTR start and end . The resulting fasta header will contain the coordinates of the UTRs and they can be parsed out.

ADD REPLY
0
Entering edit mode

I have already tried that but it does not really work since not so many UTR regions are annotated. On top of that I have some more species that I need to do it

ADD REPLY
0
Entering edit mode

In not very well studied plants that is a very usual case. Not many UTRs are known, even if you have a bed file. The transcription start and end would be same as cds start and end thereby no UTRs are annotated.

ADD REPLY
1
Entering edit mode
8.3 years ago

The canon-gff3 tool from the AEGeAn Toolkit will infer UTRs (and start/stop codons) from gene structures such as those you describe.

ADD COMMENT

Login before adding your answer.

Traffic: 1701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6