Hi,
I am trying to search for the presence of couple sequences (around 400) each with a size of 23 bps,in different fastq files, while allowing 1-2 mismatches at maximum. I am not sure if turning the fastq to a genome(transcriptome) would be a nice approach? I have tried making the fastq -> fasta -> building blast database -> running blastn, however it did not run as my query is not only one sequence.
Example part of my query.file :
ATTTTTCTGAAAAACCCCCTACGA
AACAGGAAGTCAAAAAAAGCCAA
AGGATTTTTTTTTTTCTGGGGACA
The output I am aiming to have is, for each read in my query.file, which of these sequences are having 100% (or having 1-2 mismatches) match in fastq file, and possibly where in the fastq file.
I would appreciate your suggestions! Thank you!
You could use bowtie instead of blast. Make a fasta from the fastq, build a bowtie index from it, then align the query. Bowtie has an option that controls how many mismatches are allowed in the seed (-n). As the seed (28bp) is longer than your queries, setting the max seed mismatches to 1 or 2 should be sufficient for your goal.
Thank you for your answer. I would like to try, but I have these reads in just text format, therefore I cannot turn it to fastq. I think in Bowtie I have use reads in fastq format
No, several formats are accepted:
Thank you! I have eventually used BBDUK but I will give bowtie a try soon with these options. ( -r).
I was not aware of that these is a function in BB. This BB stuff is really a jack-of-all-trades.
Hi,
May be you can try to ta align with bwa aln your 23 bps seq against your fastq files as ref after you transformed it as fasta ?
Best
Thank you for your suggestion. Would this work if my reads are in text format?