Entering edit mode
4.5 years ago
Seq225
▴
110
I have read file (~25Nt; fasta or fastq formate). I want to extract the reads that map against a long gene sequences (2Kb, fasta formate). I want to use grep. I can use bowtie or other tools. But, I want to use grep. Anybody can help me with the command, please? I want want to allow one or two mismatches per read.
Thanks!!
What have you tried so far then? Use reads to search since your reference is longer.
I actually have not tried anything yet. My target fasta is a long sequence ( I actually have several sequences of different legths, I want to search them separately). The read file has thousands of reads (or tens of thousands) Thanks!
Then you should consider using a proper aligner.
bbmap.sh
can use both fasta and fastq reads to do the search.Thanks. I will try bbmap.sh.
If your target sequence is a Fasta file, you will have new-lines every 80-100 nt, that will break your grep regex, also allowing mismatches will be not good in grep (where do you put them?)
You can use the fasta aligner to properly search without indexing
Thanks. If there is no line break in the fasta sequence? Suppose it is a long line. I can also split it into ~80 Nt and then want to find the reads that match.
You can linearize the fasta easily : Linearize fasta files
even with the sequence in a single line, you will have the problem of the mismatches, grep is great when you want an exact hit, but if you want to allow 1-2 mismatches that becomes complex, for example, to search ACGT with up to2 mismatches, you need to search:
Thanks. I think then my option would be to do only perfect matches.