From a the mapped reads of a bam file, how can extract a subset of reads that contain k-mers (for e.g., reads that contain TTT, TTTTTT, TTTTTTTT, TTTTTTTTTTTTT) ?
From a the mapped reads of a bam file, how can extract a subset of reads that contain k-mers (for e.g., reads that contain TTT, TTTTTT, TTTTTTTT, TTTTTTTTTTTTT) ?
If you want only reads where a nt is repeated a particular number of times, you can do the following
samtools view file.bam | grep 'T\{4,\}' | sed 's/^/>/' | awk -F'\t' '{print $1"\n"$10}' > foo.fa
Here, nt T
is consecutively found a minimum of 4 times. You can change the number as per your requirement.
As @genomax, suggested, updated to create a fasta file with reads. If it is mandatory to convert to fastq, I would suggest OP to do some work like finding this.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Would make more sense to extract those from the fastq, no?