Entering edit mode
4.7 years ago
kashiff007
★
1.9k
Hi,
I want to estimate the average length of each exon and intron (first 3 and last 3) from the gff/gtf file, something shown below:
exon1 500
exon2 400
exon3 300
exon-3 450
exon-2 300
exon-1 299
intron1 500
intron2 400
intron3 300
intron-3 450
intron-2 300
intron-1 299
Here exon1 is the first exon from each gene while exon-1 is the last exon from each gene. Similarly, intron1 is the first exon from each gene while intron-1 is the last exon from each gene.
I tried to search tools for this but haven't found any.
I tried to write a python code but unable to make it workable.
#!/usr/bin/python
import pandas as pd
import numpy as np
df = open("TAIR10_GFF3_genes.gff","r")
df1 = []
val = []
for line in df:
line = line.rstrip("\n")
line = line.split("\t")
if str(line[2])=="exon":
val.append(line[8])
val_unique = set(val)
print(val_unique, len(val_unique),len(val))
The input is the standard gff file:
Chr1 TAIR10 chromosome 1 30427671 . . . ID=Chr1;Name=Chr1
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR10 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1 TAIR10 protein 3760 5630 . + . ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1
Chr1 TAIR10 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 3996 4276 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3996 4276 . + 2 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 4486 4605 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 4486 4605 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 4706 5095 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 4706 5095 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 5174 5326 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 5174 5326 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 5439 5899 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 5439 5630 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 three_prime_UTR 5631 5899 . + . Parent=AT1G01010.1
Chr1 TAIR10 gene 5928 8737 . - . ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1 TAIR10 mRNA 5928 8737 . - . ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
Chr1 TAIR10 protein 6915 8666 . - . ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
Chr1 TAIR10 five_prime_UTR 8667 8737 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 8571 8666 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 8571 8737 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 8417 8464 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 8417 8464 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 8236 8325 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 8236 8325 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7942 7987 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 7942 7987 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7762 7835 . - 2 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 7762 7835 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7564 7649 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 7564 7649 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7384 7450 . - 1 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 7384 7450 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 7157 7232 . - 0 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 exon 7157 7232 . - . Parent=AT1G01020.1
Chr1 TAIR10 CDS 6915 7069 . - 2 Parent=AT1G01020.1,AT1G01020.1-Protein;
Chr1 TAIR10 three_prime_UTR 6437 6914 . - . Parent=AT1G01020.1
Chr1 TAIR10 exon 6437 7069 . - . Parent=AT1G01020.1
Chr1 TAIR10 three_prime_UTR 5928 6263 . - . Parent=AT1G01020.1
Chr1 TAIR10 exon 5928 6263 . - . Parent=AT1G01020.1
There are quite a few issues with your code, the main one being that you are not separating your analysis by mRNA, which is necessary if you want to get the average length of the 1st, 2nd, 3rd exons etc, across all mRNAs. In general, it is better to use existing libraries for parsing specific formats, rather than "inventing the wheel" yourself. In this case, I suggest you take a look at gffutils. In general, what you'll need to do is iterate over all mRNA features, get all exons for each mRNA and aggregate lengths according to exon index (1st, 2nd, etc). At the end you can calculate the average length per index.