Could anyone suggest me an easiest way to perform pairwise sequence alignment wherein my query sequence is the sequencing reads obtained from NGS and the subject sequence is a constant part expected to be found on the reads. So, i need to have a score and length of alignment as parameters to screen and filter. I would like to execute Smith-Waterman algorithm. I tried using python but it is getting complicated for me to find EMBOSS suite (which I finally downloaded) but then cannot execute it from python (tried WaterCommandline). And I am not even sure that this will give me a score and alignment length if I successfully manage to execute it. I am looking up in R now to see if it is possible and reiterate all the reads against a subject sequence. Then, I would like to splice the query sequence and store it in a file in a fasta format following the sequence alignment. Any ideas on how to conveniently do this will be appreciated.
It is unclear what you are exactly trying to do? Calling SNP? Instead of doing pairwise alignments one option could be to do NGS alignments as usual and then pull the reads out in the region you are interested in followed by converting them to fasta format and then do a multiple sequence alignment (MSA).
Firstly, I am searching for the sequence (subject) that I know is going to be present in the reads. I am interested in the reads that possess the subject sequence. I thought the best way was to do pairwise sequence alignment but I do not find a good platform for for performing these alignments. I also tried bio.pairwise2 in python but it does not align well. I hope i explained it well. What do you mean by NGS alignments? Are there any programs to do so?
Then, I remove the subject sequence part from the reads. Additionally, I would like to align the remaining part of reads to each other so that the redundant reads can be counted as '1' and then align the sorted reads to the genome. I plan to use BLAT/Bowtie for genome alignment
What I meant was alignment with a regular NGS data aligner like bbmap, bwa, bowtie2 etc. These are always going to be very efficient aligning millions of reads rather than blast other similarity search techniques. Once you have the alignment to "genome" (which could be your reduced region of interest though you may get some misalignments) you can extract reads that aligned to the region of interest.
Before you do that it sounds to me like you are expecting duplication/redundancy in this dataset. Actually
clumpify.sh
from BBMap suite may be the best way to handle that case. Take a look at this thread for the various use cases forclumpify
: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files