This was a problem presented to me from a biologist: I have .fastq file with about 1 million DNA short reads. The design of the DNA templates that I sent for sequencing was something like: ... ...GGTATNNNNNNNNNATGT... ... where the N's are randomized sequence of 9 nucleotide bases, A T G or C.
I have to align them right now, either de novo or to a reference genome (we have the reference genome) without looking at the 9 randomized sequence of bases.
How should I go about this? What tools can I use? Are there any existing DNA/RNA alignment tools out there that can do this for me?
Thank you!
Do these randomized nucleotides have a meaning? Something UMI like? Do you still need them downstream?