I have a csv file like this:
subject,query
VBAnDirWR_KB672490,INVdsDNA_NC_029032
VBAnDirWR_KB672490,VIRdsDNA_NC_029032
VBAnDirWR_KB672490,VIRdsDNA_NC_029032
VBAnDirWR_KB672490,VIRdsDNA_NC_029032
VBAnDirWR_KB672490,VIRdsDNA_NC_029032
VBAnDirWR_KB672490,BACdsDNA_NC_029032
VBAnDirWR_KB672491,VIRdsDNA_NC_029032
VBAnDirWR_KB672491,VIRdsDNA_NC_029032
VBAnDirWR_KB672491,BACdsDNA_NC_029032
VBAnDirWR_KB672491,BACdsDNA_NC_029032
VBAnDirWR_KB672491,INVdsDNA_NC_029032
VBAnDirWR_KB672492,BACdsDNA_NC_029032
VBAnDirWR_KB672492,VIRdsDNA_NC_029032
VBAnDirWR_KB672492,VIRdsDNA_NC_029032
VBAnDirWR_KB672492,VIRdsDNA_NC_029032
VBAnDirWR_KB672492,VIRdsDNA_NC_029032
As we can see, are 5 queries per subject (results of -num_alignments 5, in BLAST), I need recovery only subjects IDs that have the 'VIR' tag in 3 or more queries, so I wrote (or I tried wrote xD), a python script to do this:
import pandas as pd
list_subjects = []
vir = 0
count = 0
# loading archive...
blast = pd.read_csv('blast_teste.csv')
# converting in a data frame...
blast_df = pd.DataFrame(blast)
# separing columns in subject and query
subject = blast_df.iloc[:,0]
query = blast_df.iloc[:,1]
# reading subject and query at the same time...
for i,c in zip(subject,query):
if count < 5: # to count viral tag each five elements.
if vir < 3 and 'VIR' in c: # if tag in query
vir += 1
count += 1
elif vir >= 3: # if each 5 queries, 3 are viral, retrieve the subject ID
list_subjects.append(i)
count += 1
else:
count += 1
elif count == 5: # when count reach in five, the count is reseted to start a new step of analysis
count == 0
vir == 0
print(list_subjects)
but this code returns: ['VBAnDirWR_KB672490'] and should be ['VBAnDirWR_KB672490','VBAnDirWR_KB672492'] (because those 2 subjects have more than 3 queries with 'VIR' tag.
can anybody help me?
Thanks guys, both suggestions are working!
Best,