Entering edit mode
7.6 years ago
stacy734
▴
40
Hi everyone,
I have a very large file of genomic sequence, and need to bin them by the presence or absence of a specific 20-mer. (Everything with the 20-mer in one file, everything without it in another). I tried using grep but the sequences are multi-line.
Any suggestions will be appreciated.
Stacy
Your data is in fastq I assume? There is
grep -A2 -B1
Try bbduk.sh from BBMap suite. If sequences are in fasta format then they should still work.
bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq literal=your_20_mer_sequence k=10
if paired-end then
bbduk.sh -Xmx1g in1=r1.fq.gz in2=r2.fq.gz out1=unmatched1.fq.gz out2=unmatched2.fq.gz outm1=matched1.fq.gz outm2=matched2.fq.gz literal=your_20_mer_sequence k=10
Thanks very much!
Science marches on...
Stacy