I have a blast result like this (3 million results) (sorry I don't know how to display it better than this...) :
(blast header : ["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2
M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4
M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3
M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293
M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3
M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 18 0 0 208 225 2 19 4e-08 36.2
M02945:227:000000000-BHPRH:1:1101:17816:1743 adaptator_sequence 100.00 8 0 0 209 216 10 3 0.037 16.4
M02945:227:000000000-BHPRH:1:1101:17816:1743 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3
M02945:227:000000000-BHPRH:1:1101:17816:1743 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293
M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 18 0 0 255 272 2 19 5e-08 36.2
M02945:227:000000000-BHPRH:1:1101:15392:1766 adaptator_sequence 100.00 8 0 0 256 263 10 3 0.045 16.4
M02945:227:000000000-BHPRH:1:1101:15392:1766 vector_sequence 100.00 11 0 0 157 167 1 11 7e-04 22.3
M02945:227:000000000-BHPRH:1:1101:15392:1766 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293
...
My aim is to subset blast result by "qseqid" to process some calculation in python.
I put the whole blast result in a dataframe, named "df_blast" :
f_blast= open("blast.csv", 'rt')
try:
df_blast = pd.read_csv(f_blast, sep='\t', header=None, index_col=None, names=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
finally:
f_blast.close()
Over this dataframe I have a for loop that give me the incoming "qseqid", something like this
for index in ["M02945:227:000000000-BHPRH:1:1101:14031:1695", "M02945:227:000000000-BHPRH:1:1101:17816:1743", "M02945:227:000000000-BHPRH:1:1101:15392:1766"] :
First I tried in my for loop :
df_blast_line = df_blast.loc[[index]]
I got on my first step :
M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2
M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4
M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3
M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293
M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3
This solution works, but it is too much time consuming, around 5h30 (using good indexation and filtering, 4h30, still too much)
I also try to drop rows after subsetting :
df_blast.drop([index], inplace=True)
But python is taking too much time to delete these rows...
Second, I tried just before my loop to subset my dataframe into a dictionnary :
dic_blast={}
for index, line in df_blast.iterrows():
if line['qseqid'] not in dic_blast :
dic_blast[line['qseqid']] = pd.DataFrame([line], columns=["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"])
else:
dic_blast[line['qseqid']] = dic_blast[line['qseqid']].append([line], ignore_index=True)
And then in the "for loop" :
df_blast_line = dic_blast_sequences[index]
This solution is way better, around 1h15 to build the dictionnary and few minutes for to "for loop". Winning 3h.
Does someone know a better way to go faster ? I'm very interested I have multiple blast results to process...
Thanks !
To me, it's a bit unclear what your desired output is. A for loop is for sure not the way to go for a pandas DataFrame. Perhaps groupby() is what you are looking for?
Note that you can do
pd.read_csv()
directly from a file, there is no need to first open a handle. df_blast = pd.read_csv("blast.csv",...)I have a blast result in tsv. I have a list of id in a separate file (corresponding to qseqid in the blast file). Here, are the first 5 lines of the blast result (same as above) :
M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 18 0 0 223 240 2 19 4e-08 36.2
M02945:227:000000000-BHPRH:1:1101:14031:1695 adaptator_sequence 100.00 8 0 0 224 231 10 3 0.040 16.4
M02945:227:000000000-BHPRH:1:1101:14031:1695 vector_sequence 100.00 11 0 0 157 167 1 11 6e-04 22.3
M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 148 0 0 9 156 1 148 1e-85 293
M02945:227:000000000-BHPRH:1:1101:14031:1695 smu_sequence 100.00 10 0 0 67 76 10 1 0.003 20.3
I want to know by example for each "qseqid", if the order smu_sequence-vector_sequence-adaptator_sequence on my query is respected.
In this case I have 9-156 for smu_sequence, 157-167 for vector_sequence and 223-240 for adaptator_sequence. So 9 <156 < 157 < 167 < 223 < 240, qseqid = M02945:227:000000000-BHPRH:1:1101:14031:1695 pass the test, I keep it. Then I go for another qseqid, same tests etc...
I'm not sure I could do all these tests in a groupby but i'm just not familiar with.
Thanks for the tips !