Question

Filtering BLAST results

0

Entering edit mode

6.2 years ago

flogin ▴ 280

I have a csv file like this:

subject,query

VBAnDirWR_KB672490,INVdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,BACdsDNA_NC_029032

VBAnDirWR_KB672491,VIRdsDNA_NC_029032

VBAnDirWR_KB672491,VIRdsDNA_NC_029032

VBAnDirWR_KB672491,BACdsDNA_NC_029032

VBAnDirWR_KB672491,BACdsDNA_NC_029032

VBAnDirWR_KB672491,INVdsDNA_NC_029032

VBAnDirWR_KB672492,BACdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

As we can see, are 5 queries per subject (results of -num_alignments 5, in BLAST), I need recovery only subjects IDs that have the 'VIR' tag in 3 or more queries, so I wrote (or I tried wrote xD), a python script to do this:

import pandas as pd
list_subjects = []
vir = 0
count = 0

# loading archive...
blast = pd.read_csv('blast_teste.csv')
# converting in a data frame...
blast_df = pd.DataFrame(blast)
# separing columns in subject and query
subject = blast_df.iloc[:,0]
query = blast_df.iloc[:,1]
# reading subject and query at the same time...
for i,c in zip(subject,query): 
    if count < 5: # to count viral tag each five elements.
        if vir < 3 and 'VIR' in c: # if tag in query
            vir += 1 
            count += 1
        elif vir >= 3: # if each 5 queries, 3 are viral, retrieve the subject ID
            list_subjects.append(i)
            count += 1
        else:
            count += 1
    elif count == 5: # when count reach in five, the count is reseted to start a new step of analysis
        count == 0
        vir == 0
print(list_subjects)

but this code returns: ['VBAnDirWR_KB672490'] and should be ['VBAnDirWR_KB672490','VBAnDirWR_KB672492'] (because those 2 subjects have more than 3 queries with 'VIR' tag.

can anybody help me?

alignment blast python • 1.6k views

ADD COMMENT • link 6.2 years ago by flogin ▴ 280

0

Entering edit mode

Thanks guys, both suggestions are working!

Best,

ADD REPLY • link 6.2 years ago by flogin ▴ 280

score 4 · Accepted Answer · 2019-02-21

In Perl, using the same logic suggested by Carambakaracho .

#!/usr/bin/perl

use strict;
use warnings;

my %filter = ();
while (<>){
    my( $subject, $query ) = split ",";
    if( $query =~ /^VIR/ ){ $filter{$subject}++; }
}

foreach my $key ( keys %filter ){
    if( $filter{$key} > 2 ){ print "$key\n" };
}

score 3 · Accepted Answer · 2019-02-21

This looks much like some sort of constructed problem, does it? (Not to say "Homework") In any way, you're half way there - You need to reset the count variable, though it's quite useless. Simpler convert list_subjects to a dictionary, subject as key, 'VIR' count as value, increment each 'VIR' count per subject, return subjects with count >= 3