Hey guys,
I've written this python code.
import pandas as pd
from Bio import SeqIO
import argparse
parser= argparse.ArgumentParser(add_help=False)
parser.add_argument("-h", "--help", action="help", default=argparse.SUPPRESS, help= "Get partial gff given a pattern on Names field")
parser.add_argument("-g", help= "-g: gff file", required = "True")
parser.add_argument("-l", help= "-l: list of patterns to search on Names gff field", required = "True")
parser.add_argument("-o", help= "-o: output file", required = "True")
args = parser.parse_args()
#make a list with the IDs
terms =[]
with open(args.l) as f:
for l in f:
terms.append(l.rstrip("\n"))
#open gff file
m_names=('seqname','source','feature','start','end', 'score','strand','frame','Names')
df4=pd.read_csv(args.g, names=m_names, sep="\t", skiprows=35)
#save partial gff file
df5 = df4[df4['Names'].str.contains('|'.join(terms), na=False)]
df5.to_csv(args.o, index=False, header=None, sep="\t")
I basically filter a large gff based on a list of patterns found on the last off field, which I call 'Names".
The funny thing is, if I have a list of IDs such:
['XP_037652843.1', 'XP_037652864.1']
And my original gff is (note columns 4 and 5):
NC_051307.1 Gnomon CDS 111202 111993 . + 0 ID=cds-XP_037660565.1;Parent=rna-XM_037804637.1;Dbxref =GeneID:119510355,Genbank:XP_037660565.1;Name=XP_037660565.1;gbkey=CDS;gene=LOC119510355;product=vegetative cell wall protein gp1-like;protein _id=XP_037660565.1
Once I ran the code I get a ".0" the third and fourth column. Such as (note columns 4 and 5) :
NC_051307.1 Gnomon CDS 111202.0 111993.0 . + 0 ID=cds-XP_037660565.1;Parent=rna-XM_037804637.1;Dbxref =GeneID:119510355,Genbank:XP_037660565.1;Name=XP_037660565.1;gbkey=CDS;gene=LOC119510355;product=vegetative cell wall protein gp1-like;protein _id=XP_037660565.1
So it seems to me as soon as I run the code python transforms my int into floats? Does any one knows why and what should I do for that not to happen? Thank you so much! =)
Hi,
Based on this:
https://stackoverflow.com/questions/39666308/pd-read-csv-by-default-treats-integers-like-floats
I guess what you could do to handle NaN and intiger values at the same time for a column is:
df4 = pd.read_csv(args.g, names=m_names, sep="\t", skiprows=35, dtype={'start': 'Int32', 'end': 'Int32'})
I have not tried this one, but it seems to do the trick.
Best, Armin