Is it possible to isolate and merge overlapping hits from a BLAST search Aligned Fasta file?
1
0
Entering edit mode
3.2 years ago
Leah ▴ 10

Hello all,

Hope you're well. I started a PhD in Jan this year and frankly I am struggling. I'm coming into the second week of trying to figure out how to attempt this problem and it's honestly starting to get to me a bit - the current task I'm working on is not my main project and every day I don't make progress is a delay on my actual project. So! A bit worried to say the least (!)

Here is the issue: I ran some sequences using BLASTn. I downloaded the Aligned Fasta file for each. I have sorted reads by if they are a single read (ie the accession number only appears once) or multiple reads of the same entry (example below, edited for brevity):

SINGLE HIT

>XM_040233432|1:208-654_PREDICTED:_Oryx_dammah_lysozyme_C_kidney_isozyme_(LOC120857726)_mRNA
ATGAAGGTTCTCCTTATTCTGGGGCTTCTCCTCCTTTCAGTCGCTGTCCAAGGCAAGGTCTTTGAGAGATGTGAGCT...

MULTIPLE HIT

>XM_042246932|1:88-371_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
TGAGGGCTGTCCTTATTCTGGGGCTTCTCCTCCTTTCTGTCACAGTCCAGGGCAAGAAATTTCAGAGGTGTGAGCTTGCC..
>XM_042246932|1:388-668_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
GGTTGTGTTTGACCAAATGGGAAAGCAGTTATAACACAAAAACGACAAACTACAATCCTAACAGTGAAAGCACTGATTAT...

This is so I can use the seqkit to combine the multiple hit entries. Once I've done that, I'll be filtering based on query sequence length (using bioawk which I know how to do) and carrying on with the pipeline (alignment, analysis, ect.).

The problem is, some of these multiple hits are overlapping and I have no idea how to treat them! For example:

>XM_023558748|1:**49-362**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
ATGAAGGCTCTCCTTATTCTGGGCCTTCTCTTCCTTTCTGGCACTGCCCAGGGCAAGGTCTTTGAAAGGTGCGAGTTGGC..

>XM_023558748|1:**349-416**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
GGGTGGCATGGAGAAACCATTGTCAGAACCACGATGTCACTCAGTACATTCAAGGCTGTGGAGTCTAA

My two questions:

  1. Is there a way I could isolate these overlapping sequences from the nonoveralpping multiple hits? I just can't fathom how I can achieve that right now with my current skill set.

  2. Aside from using the EMBOSS merger command, is there another way for me to merge these overlapping files together? I did give it a go and it had wonderfully merged the above Elephant example correctly. But I am going to be dealing with thousands of hits, is it possible to write a python script to run merger through the overlapping files? I realise this is a tall order. I'd try to configure this script to use seqkit to merge the non-overlapping multiple hits too.

Thank you kindly for reading - I have trawled these forums the last few months and frankly I only think I've managed to get where I am thanks to your answers. Any advice, suggestions or critiques are gratefully received.

GenBank Fasta Linux Python BLAST • 1.8k views
ADD COMMENT
0
Entering edit mode

Not sure if I understand what are you trying to do. But, if you get your blastn results as a table (outfmt 6) it would be easier to analyze them. You will be able to classify/filter blastn hits by evalue, score, etc. and also, to identify querys matching multiple subjects.

ADD REPLY
0
Entering edit mode

Thank you for the response Buffo - I used the online BLAST instead of command line. I did suggest to my supervisor about going about it the command line way but we went with online. I am kicking myself slightly as it does seem to be more commonly used.

It does sound like it would help but not necessarily solve the issue I am having which is how to treat these overlapping hits. While I'm reticent to recollect my data and set myself back a few weeks, if it means making this issue easier to solve, I'll give it a go. Thank you for taking the time out to answer

ADD REPLY
0
Entering edit mode
3.2 years ago
Jiyao Wang ▴ 380

I am not sure whether this will help. When you get the result page, a link near the top left allows you to download the search result in a JSON blob to be parsed.

ADD COMMENT
0
Entering edit mode

Thank you kindly for your response Jiyao Wang - I've never come across JSON before but it looks pretty interesting. I am going to attempt to have a go another way but thank you for the suggestion and I appreciate you taking the time out to answer

ADD REPLY

Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6