Hi guys,
Currently I'm working with a lot of BLASTn analysis, I have an output in tabular format (argument -outfmt 6) with more than 170 thousand lines. In that link: https://textuploader.com/11krp has an example with pieces of information that I want: query, subject, subject start, subject end and score (in that order).
As we can see, different queries can be matched with the same subject, in the same or in different positions.
In the next steps, I'm gonna extract those regions of the subject, using the positions of start and end, but if I make that extraction with this type of information, I'm gonna recovery a lot of redundant sequences..
In my point of view, have 4 cases of redundant matches:
1 - Same region of subject = same s_start and same s_end with different scores;
e.g. lines 29, 33, 37 and 43
2 - Almost the same regions of subject 1 = s_start different, s_end equal with different scores;
e.g. lines 26 (s_start = 928719), 18, 30, 34, 38 (s_start = 928718)
3 - Almost the same region of subject 2 = s_start equal, s_end different with different scores
e.g. lines 18, 30, 34, 38 (s_end = 929459) and 44 (s_end = 929456).
4 - case four, different length of the same region = s_tart and s_end differents, but covering the same subject region, different scores.
e.g. lines 17 (s_start = 922442, s_end = 923192), 29, 33, 37, 43 (s_tart = 922444, s_end = 923190)
So... I have a little experience with Python, and wrote the following script:
import csv
# openning file
with open('blast_test.csv') as csv_file:
subject_dict = {} # dictionary to store original informations
subject_dict_2 = {} #dictionary to store filtred informations
csv_reader = csv.reader(csv_file, delimiter=',')
# creating a dictionary with subjects information
#reading file line by line
for row in csv_reader:
print(row)
#atribuiting each column to one variable, modfying the name of subject
query,subject_old, subject_new, s_start, s_end, score = row[0],row[1],row[1]+'_'+row[2]+'_'+row[3], row[2], row[3], row[4]
# inserting subjects in a dictionary
subject_dict[subject_new] = [subject_old, query, s_start, s_end]
#
#testing dictionary
for k,v in subject_dict.items():
print(k,':',v)
making comparisons
for k,v in subject_dict.items():
# if
# creating an output
with open('blast_test_filtred.csv', mode='w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for subject in subject_dict:
writer.writerow([subject, s_start, s_end, score, query)])
My logical is:
1 - Create a dictionary with all informations, changing the subject name (just to facilitate my understanding of outputs)
2 - Remove the redundant information using the criteria of the four cases cited above;
3 - Write this new information in an output file.
To remove this redundant information, I think in create a threshold of 10 nucleotides up and downstream of each region (start and end), following by comparisons of regions using the subject original name (subject_old), and selecting the region with the best score (in a manner to recovery all different regions)).
Can anyone explain to me how I can perform this step cited above?
Thanks.
The easiest way is to use
pandas
library in python for such tabular data. Once the data is in a dataframe format, it would be easier for you to group, filter and aggregate this data. There are a range of pandas tutorials, blogs and worked examples available that would help you with this.Thanks Sej, I'm gonna check this!
So.. I work now with pandas and with all columns of BLAST output.
But only the column names are returned....
It would be easier to use
groupby
in your dataframe and then sort by column of interest.Also see https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html and https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas for comparing column values in pandas.
I think that groupby maybe don't be applied to my problem, because, following the guide from pydata.org, the filter work in 2 ways: Discard data that belongs to groups with only a few members or Filter out data based on the group sum or mean. And I want to make interactive comparisons between several columns...
Using the method .drop_duplicates() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html), I got to remove the duplicates matches based on sseqid, sstart and ssend:
But have more 2 steps that I could not execute yet: - make those comparisons in a range of nucleotides (-10 and + 10 of each sstart and send) - Keep the row with major bitscore...