Mapping fasta/fastq reads against a long fasta sequence using grep

0

Entering edit mode

4.5 years ago

Seq225 ▴ 110

I have read file (~25Nt; fasta or fastq formate). I want to extract the reads that map against a long gene sequences (2Kb, fasta formate). I want to use grep. I can use bowtie or other tools. But, I want to use grep. Anybody can help me with the command, please? I want want to allow one or two mismatches per read.

Thanks!!

sequence R alignment assembly genome • 2.2k views

ADD COMMENT • link 4.5 years ago by Seq225 ▴ 110

1

Entering edit mode

What have you tried so far then? Use reads to search since your reference is longer.

ADD REPLY • link 4.5 years ago by GenoMax 147k

0

Entering edit mode

I actually have not tried anything yet. My target fasta is a long sequence ( I actually have several sequences of different legths, I want to search them separately). The read file has thousands of reads (or tens of thousands) Thanks!

ADD REPLY • link 4.5 years ago by Seq225 ▴ 110

3

Entering edit mode

Then you should consider using a proper aligner. bbmap.sh can use both fasta and fastq reads to do the search.

ADD REPLY • link 4.5 years ago by GenoMax 147k

0

Entering edit mode

Thanks. I will try bbmap.sh.

ADD REPLY • link 4.5 years ago by Seq225 ▴ 110

1

Entering edit mode

If your target sequence is a Fasta file, you will have new-lines every 80-100 nt, that will break your grep regex, also allowing mismatches will be not good in grep (where do you put them?)

You can use the fasta aligner to properly search without indexing

ADD REPLY • link 4.5 years ago by JC 13k

0

Entering edit mode

Thanks. If there is no line break in the fasta sequence? Suppose it is a long line. I can also split it into ~80 Nt and then want to find the reads that match.

ADD REPLY • link 4.5 years ago by Seq225 ▴ 110

1

Entering edit mode

You can linearize the fasta easily : Linearize fasta files

ADD REPLY • link 4.5 years ago by GenoMax 147k

1

Entering edit mode

even with the sequence in a single line, you will have the problem of the mismatches, grep is great when you want an exact hit, but if you want to allow 1-2 mismatches that becomes complex, for example, to search ACGT with up to2 mismatches, you need to search:

ACGT
.CGT
A.CGT
AC.T
ACG.
..GT
A..GT
A.C.
A..T
.C.T
AC..