Question

Fastest way to subset dataframe in python

0

Entering edit mode

6.9 years ago

Bastien Hervé 5.9k

I have a blast result like this (3 million results) (sorry I don't know how to display it better than this...) :

(blast header : ["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 18 0 0 208 225 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 8 0 0 209 216 10 3 0.037 16.4

M02945:227:000000000-BHPRH:1:1101:17816:1743 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:17816:1743 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 18 0 0 255 272 2 19 5e-08 36.2

M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 8 0 0 256 263 10 3 0.045 16.4

M02945:227:000000000-BHPRH:1:1101:15392:1766 vector_sequence 100.00 11 0 0 157 167 1 11 7e-04 22.3

M02945:227:000000000-BHPRH:1:1101:15392:1766 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

...

My aim is to subset blast result by "qseqid" to process some calculation in python.

I put the whole blast result in a dataframe, named "df_blast" :

f_blast= open("blast.csv", 'rt')
try:
    df_blast = pd.read_csv(f_blast, sep='\t', header=None, index_col=None, names=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
finally:
    f_blast.close()

Over this dataframe I have a for loop that give me the incoming "qseqid", something like this

for index in ["M02945:227:000000000-BHPRH:1:1101:14031:1695", "M02945:227:000000000-BHPRH:1:1101:17816:1743", "M02945:227:000000000-BHPRH:1:1101:15392:1766"] :

First I tried in my for loop :

df_blast_line = df_blast.loc[[index]]

I got on my first step :

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

This solution works, but it is too much time consuming, around 5h30 (using good indexation and filtering, 4h30, still too much)

I also try to drop rows after subsetting :

df_blast.drop([index], inplace=True)

But python is taking too much time to delete these rows...

Second, I tried just before my loop to subset my dataframe into a dictionnary :

dic_blast={}
for index, line in df_blast.iterrows():
    if line['qseqid'] not in dic_blast :
        dic_blast[line['qseqid']] = pd.DataFrame([line], columns=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
    else:
        dic_blast[line['qseqid']] = dic_blast[line['qseqid']].append([line], ignore_index=True)

And then in the "for loop" :

df_blast_line = dic_blast_sequences[index]

This solution is way better, around 1h15 to build the dictionnary and few minutes for to "for loop". Winning 3h.

Does someone know a better way to go faster ? I'm very interested I have multiple blast results to process...

Thanks !

python blast dataframe • 10k views

ADD COMMENT • link updated 6.9 years ago by shoujun.gu ▴ 380 • written 6.9 years ago by Bastien Hervé 5.9k

1

Entering edit mode

To me, it's a bit unclear what your desired output is. A for loop is for sure not the way to go for a pandas DataFrame. Perhaps groupby() is what you are looking for?

Note that you can do pd.read_csv() directly from a file, there is no need to first open a handle. df_blast = pd.read_csv("blast.csv",...)

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

I have a blast result in tsv. I have a list of id in a separate file (corresponding to qseqid in the blast file). Here, are the first 5 lines of the blast result (same as above) :

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2

M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4

M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293

M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3

I want to know by example for each "qseqid", if the order smu_sequence-vector_sequence-adaptator_sequence on my query is respected.

In this case I have 9-156 for smu_sequence, 157-167 for vector_sequence and 223-240 for adaptator_sequence. So 9 <156 < 157 < 167 < 223 < 240, qseqid = M02945:227:000000000-BHPRH:1:1101:14031:1695 pass the test, I keep it. Then I go for another qseqid, same tests etc...

I'm not sure I could do all these tests in a groupby but i'm just not familiar with.

Thanks for the tips !

ADD REPLY • link 6.9 years ago by Bastien Hervé 5.9k

score 2 · Accepted Answer · 2018-01-24

2

Entering edit mode

6.9 years ago

shoujun.gu ▴ 380

I think what you need is some code like this:

   import  pandas as pd

    df=pd.read_csv('your blast file')
    labels=df['qseqid'].unique()

    for label in labels:
        df_temp=df.loc[df['qseqid']==label,:]
        # do whatever you want to do

ADD COMMENT • link 6.9 years ago by shoujun.gu ▴ 380

1

Entering edit mode

I would suggest setting the qseqid column as index for the DataFrame and just slice like that, for further simplification:

df=pd.read_csv('your blast file', index_col='qseqid')

for label in df.index.unique():
    df_temp=df.loc[label]

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

With the qseqid column as index, it takes 40 minutes to compute.

I found an other solution that takes 30 minutes

df_blast = pd.read_csv('blast.csv' index_col='qseqid')
df_blast.sort_index(inplace=True)
previous_index=""

for index, line in df_blast.iterrows():
    if previous_index == "" :
        previous_index = index
        df_temp = pd.DataFrame([line], index=[index])
    else:
        if previous_index == index:
            df_temp = df_temp.append([line])
        else:
            #Do stuff here
            previous_index = index
            df_temp = pd.DataFrame([line], index=[index])

Thanks to both of you !

ADD REPLY • link 6.9 years ago by Bastien Hervé 5.9k