Hello,
I am trying to extract information from a gff-file using the gffutils package (https://daler.github.io/gffutils/) in python. After creating and loading the "local database" from the gff-file, I want to extract CDS entries for every gene entry (start and stop positions of the CDS), however, I noticed that some genes have multiple entries for CDS, wherein two or more CDS entries either have the same start and/or stop position.
Example : A gene (X) has two CDS entries (CDS_1, CDS_2) and the start and stop positions of these two CDSs are - CDS_1 - start : 75221, stop : 76890 (transcript id - ABC1.1) ; CDS_2 - start : 75221, stop : 76908 (transcript id - ABC1.2)
Now, CDS_2 has 18bp more "coverage" than CDS_1, so my logic asks me to delete CDS_1 and only consider CDS_2 for mRNA. Am I correct in assuming so? Also, how do you delete such potentially obsolete/duplicate(?) entries? I tried reading through the documentation of this package, but I could not find any solution.
Thanks in advance!
@Juke-34 many thanks!
I guess this should help. Just out of curiosity - I see this package is for Perl? Do you have any suggestions for python? I could always just trim the file of the short isoforms (thanks for correcting me) and load it in python (I suppose?) and then perform the next tasks, but would be super if I could do this in python too!
it is available as conda package so you don't care it is in perl ;)