Question

Extracting reads with homopolymer errors

0

Entering edit mode

8.8 years ago

aparnakrish89 • 0

From a the mapped reads of a bam file, how can extract a subset of reads that contain k-mers (for e.g., reads that contain TTT, TTTTTT, TTTTTTTT, TTTTTTTTTTTTT) ?

homopolymer errors • 2.0k views

ADD COMMENT • link updated 8.8 years ago by venu 7.1k • written 8.8 years ago by aparnakrish89 • 0

0

Entering edit mode

Would make more sense to extract those from the fastq, no?

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

score 0 · Answer 1 · 2016-11-04

0

Entering edit mode

8.8 years ago

venu 7.1k

If you want only reads where a nt is repeated a particular number of times, you can do the following

samtools view file.bam | grep 'T\{4,\}' | sed 's/^/>/' | awk -F'\t' '{print $1"\n"$10}' > foo.fa

Here, nt T is consecutively found a minimum of 4 times. You can change the number as per your requirement.

As @genomax, suggested, updated to create a fasta file with reads. If it is mandatory to convert to fastq, I would suggest OP to do some work like finding this.

ADD COMMENT • link 8.8 years ago by venu 7.1k

0

Entering edit mode

You may want to extend this to generate reads in fastq format which is what OP likely wants.

ADD REPLY • link 8.8 years ago by GenoMax 153k