Hello all,
Hope you're well. I started a PhD in Jan this year and frankly I am struggling. I'm coming into the second week of trying to figure out how to attempt this problem and it's honestly starting to get to me a bit - the current task I'm working on is not my main project and every day I don't make progress is a delay on my actual project. So! A bit worried to say the least (!)
Here is the issue: I ran some sequences using BLASTn. I downloaded the Aligned Fasta file for each. I have sorted reads by if they are a single read (ie the accession number only appears once) or multiple reads of the same entry (example below, edited for brevity):
SINGLE HIT
>XM_040233432|1:208-654_PREDICTED:_Oryx_dammah_lysozyme_C_kidney_isozyme_(LOC120857726)_mRNA
ATGAAGGTTCTCCTTATTCTGGGGCTTCTCCTCCTTTCAGTCGCTGTCCAAGGCAAGGTCTTTGAGAGATGTGAGCT...
MULTIPLE HIT
>XM_042246932|1:88-371_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
TGAGGGCTGTCCTTATTCTGGGGCTTCTCCTCCTTTCTGTCACAGTCCAGGGCAAGAAATTTCAGAGGTGTGAGCTTGCC..
>XM_042246932|1:388-668_PREDICTED:_Ovis_aries_lysozyme_C-3-like_(LOC106990182)
GGTTGTGTTTGACCAAATGGGAAAGCAGTTATAACACAAAAACGACAAACTACAATCCTAACAGTGAAAGCACTGATTAT...
This is so I can use the seqkit to combine the multiple hit entries. Once I've done that, I'll be filtering based on query sequence length (using bioawk which I know how to do) and carrying on with the pipeline (alignment, analysis, ect.).
The problem is, some of these multiple hits are overlapping and I have no idea how to treat them! For example:
>XM_023558748|1:**49-362**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
ATGAAGGCTCTCCTTATTCTGGGCCTTCTCTTCCTTTCTGGCACTGCCCAGGGCAAGGTCTTTGAAAGGTGCGAGTTGGC..
>XM_023558748|1:**349-416**_PREDICTED:_Loxodonta_africana_lysozyme_C_kidney_isozyme_(LOC100666513)
GGGTGGCATGGAGAAACCATTGTCAGAACCACGATGTCACTCAGTACATTCAAGGCTGTGGAGTCTAA
My two questions:
Is there a way I could isolate these overlapping sequences from the nonoveralpping multiple hits? I just can't fathom how I can achieve that right now with my current skill set.
Aside from using the EMBOSS merger command, is there another way for me to merge these overlapping files together? I did give it a go and it had wonderfully merged the above Elephant example correctly. But I am going to be dealing with thousands of hits, is it possible to write a python script to run merger through the overlapping files? I realise this is a tall order. I'd try to configure this script to use seqkit to merge the non-overlapping multiple hits too.
Thank you kindly for reading - I have trawled these forums the last few months and frankly I only think I've managed to get where I am thanks to your answers. Any advice, suggestions or critiques are gratefully received.
Not sure if I understand what are you trying to do. But, if you get your blastn results as a table (outfmt 6) it would be easier to analyze them. You will be able to classify/filter blastn hits by evalue, score, etc. and also, to identify querys matching multiple subjects.
Thank you for the response Buffo - I used the online BLAST instead of command line. I did suggest to my supervisor about going about it the command line way but we went with online. I am kicking myself slightly as it does seem to be more commonly used.
It does sound like it would help but not necessarily solve the issue I am having which is how to treat these overlapping hits. While I'm reticent to recollect my data and set myself back a few weeks, if it means making this issue easier to solve, I'll give it a go. Thank you for taking the time out to answer