Binning Illumina reads by subread?
1
0
Entering edit mode
7.6 years ago
stacy734 ▴ 40

Hi everyone,

I have a very large file of genomic sequence, and need to bin them by the presence or absence of a specific 20-mer. (Everything with the 20-mer in one file, everything without it in another). I tried using grep but the sequences are multi-line.

Any suggestions will be appreciated.

Stacy

next-gen illumina binning fasta • 1.6k views
ADD COMMENT
2
Entering edit mode

Your data is in fastq I assume? There is grep -A2 -B1

ADD REPLY
2
Entering edit mode

Try bbduk.sh from BBMap suite. If sequences are in fasta format then they should still work.

bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq literal=your_20_mer_sequence k=10

if paired-end then

bbduk.sh -Xmx1g in1=r1.fq.gz in2=r2.fq.gz out1=unmatched1.fq.gz out2=unmatched2.fq.gz outm1=matched1.fq.gz outm2=matched2.fq.gz literal=your_20_mer_sequence k=10

ADD REPLY
0
Entering edit mode

Thanks very much!

Science marches on...

Stacy

ADD REPLY
2
Entering edit mode
7.6 years ago
h.mon 35k

BBDuk will do what you want (and possibly more):

bbduk.sh k=20 in=genomic.fasta out=without_kmer.fasta outm=with_kmer.fasta literal=ATCGATCGATCGATCG

or

bbduk.sh k=20 in=genomic.fasta out=without_kmer.fasta outm=with_kmer.fasta ref=kmer.fasta

To allow for one mismatch:

bbduk.sh k=20 hdist=1 in=genomic.fasta out=without_kmer.fasta outm=with_kmer.fasta ref=kmer.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6