Hello,
There was a nanopore sequence of a chimeric virus which was made by 2 different but similar viruses given to me (as a .fasta file) and I am tasked with finding out which of the reads align with virus 1 and which one virus 2.
I trimmed and cleaned the data and also annotated them using a file that contains both of the unique and common genes found between the virus 1 and 2 and I also made that a fasta file.
For finding out the percentage, I tried to use distance scoring and wrote a script myself but it would take too long. I tried looking for a tool that does this for me. Does anybody have any tools or ideas on what can i do to finally get the results i want?
Thank you very much for you help. I really do appreciate it :)
First of all, thank you very much for you recommendation Dr. Andrews. That is a very cool tool. But i feel like I should've been more specific.
I just had an additional question. I've aligned both of the genes using minimap2 and I went through the normal aligning procedure.
I have to take these 2 files and compare them with each other. For example, out of 300k reads that I have, maybe the first match 100% with both Virus1 and Virus2. I want to ignore this read and have an ignored_counter that is +=1.
But if one's score is higher than the other one, for example, out of 500 nucleotides, 400 matched with virus 1 and 490 matched with virus2, I want to have the gene name which I have annotated, % or the score on the matching, and the sequence in a text. so in the 2nd example it would be:
Virus2 490/500 (or any other score like distancing score) ATCC...GCAAC
or something along these lines. Is it possible to do this analysis with this build or is there any other tool(s) available for this or do I have to write a script that does all of these?
If you need to check on individual base results then perhaps using
blast
may be an option. That should also give you control over gap penalties etc and parsing the output may be easier withoutfmt 6
.magicblast
if you have lots of shot input queries.bbsplit will give you counts/percentages of ambiguous reads, which you can then handle however you'd like (see the
-ambiguous
and -ambiguous2
parameters for bbsplit or-ambig
for seal). You can toss those reads, assign them to one or both reference, whatever. It'll spit out stats regardless of what you choose to do with the reads.It sounds like it may be worth splitting the reads by virus first, then dealing with the gene annotations after the fact if you want counts or whatever for them.