Question

python. Extracting start and end positions per gene from gff

0

Entering edit mode

10.4 years ago

Karyo ▴ 10

Hi, I have a madeup GFF file with 4 distinguished gene features.

They are:

gene id with both start and end postion information
gene id without start_codon information
gene id without end_codon information
gene id with neither information

scaffold_1 JGI exon 9377 9561 . + . fungene_21462
scaffold_1 JGI CDS 9456 9561 . + 0 fungene_21462
scaffold_1 JGI start_codon 9456 9458 . + 0 fungene_21462
scaffold_1 JGI exon 9719 9942 . + . fungene_21462
scaffold_1 JGI CDS 9719 9942 . + 2 fungene_21462
scaffold_1 JGI exon 9997 10406 . + . fungene_21462
scaffold_1 JGI CDS 9997 10406 . + 0 fungene_21462
scaffold_1 JGI exon 10440 11103 . + . fungene_21462
scaffold_1 JGI CDS 10440 10845 . + 1 fungene_21462
scaffold_1 JGI stop_codon 10843 10845 . + 0 fungene_21462

scaffold_2 JGI exon 19377 19561 . + . fungene_20001
scaffold_2 JGI CDS 19456 19561 . + 0 fungene_20001
scaffold_2 JGI exon 19719 19942 . + . fungene_20001
scaffold_2 JGI CDS 19719 19942 . + 2 fungene_20001
scaffold_2 JGI exon 19997 110406 . + . fungene_20001
scaffold_2 JGI CDS 19997 20406 . + 0 fungene_20001
scaffold_2 JGI exon 20440 21103 . + . fungene_20001
scaffold_2 JGI CDS 20440 20845 . + 1 fungene_20001
scaffold_2 JGI stop_codon 20843 20845 . + 0 fungene_20001

scaffold_3 JGI exon 29377 29561 . + . fungene_3002
scaffold_3 JGI CDS 29456 29561 . + 0 fungene_3002
scaffold_3 JGI start_codon 29456 29458 . + 0 fungene_3002
scaffold_3 JGI exon 29719 29942 . + . fungene_3002
scaffold_3 JGI CDS 29719 29942 . + 2 fungene_3002
scaffold_3 JGI exon 29997 30406 . + . fungene_3002
scaffold_3 JGI CDS 29997 30406 . + 0 fungene_3002
scaffold_3 JGI exon 30440 31103 . + . fungene_3002
scaffold_3 JGI CDS 30440 30845 . + 1 fungene_3002

scaffold_4 JGI exon 49377 49561 . + . fungene_33333
scaffold_4 JGI CDS 49456 49561 . + 0 fungene_33333
scaffold_4 JGI exon 49719 49942 . + . fungene_33333
scaffold_4 JGI CDS 49719 49942 . + 2 fungene_33333
scaffold_4 JGI exon 49997 50406 . + . fungene_33333
scaffold_4 JGI CDS 49997 50406 . + 0 fungene_33333
scaffold_4 JGI exon 50440 51103 . + . fungene_33333
scaffold_4 JGI CDS 50440 50845 . + 1 fungene_33333

But, I want to make the file like each gene id with start_codon start position and end_codon end position in a line when the information is available. However, when information lack, I want the very first or last exon position information to take the place like the format below.

fungene_21462  start_pos= 9456    end_pos= 10845
fungene_20001  start_pos= 19377  end_pos= 20845
fungene_3002   start_pos= 29456  end_pos= 31103
fungene_33333  start_pos= 49377  end_pos= 51103

Simply, can anybody show me a python code to extract start and end position information per gene, please?

genome python gff • 4.5k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Karyo ▴ 10

1

Entering edit mode

What have you tried? While one of us could write a little script to just extract the information, it's usually preferable for you to at least make a first attempt to do so yourself.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

I have written down a very long script reading various data, which makes me hard to cut the script from the entire code. So I have made a simplified version of question, just extracting position information per gene. Any tools and cod, I will try to apply on my code. Thank you for your interests.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by Karyo ▴ 10

1

Entering edit mode

The general idea is to just memoize a given gene, its strand and its 5' and 3' extremes. Once a new gene is encountered, the start and stop position of the previous gene is then the memoized extremes, where the strand determines which is the start and which the stop. Assuming you already know how to parse the GFF format and iterate through the records, this should be relatively straight forward.

BTW, it's simplest to just use CDS features for all of this, since that should be present for protein-coding genes.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

The problem with CDS is it does not include the stop condon triplets, but truncates the stop codon. I am looking for a simple parser which finds out a gene start and end (until the last digit of a stop codon).

ADD REPLY • link 10.4 years ago by Karyo ▴ 10

0

Entering edit mode

In the examples you posted, the stop codon is contained within the CDS. Even if it weren't then you'd just need to increment or decrement a coordinate by 3.

And anyway, if you just want to substitute the exon bounds instead then just do that and also keep track of whether the start/stop codons were seen. This is just a simple bit of looping.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k