I have a bam file that I want to filter by the TLEN (9th) column, then output the reads that pass filter in fastq format. I have tried combinations of awk and samtools bam2fq without much luck.
Anybody?
I have a bam file that I want to filter by the TLEN (9th) column, then output the reads that pass filter in fastq format. I have tried combinations of awk and samtools bam2fq without much luck.
Anybody?
You could combine the suggestions so far into a script like so (the BAM file should be name sorted):
samtools view -h name_sorted.bam | awk -f filter.awk | samtools view -Sb - | bedtools bamtofastq -i - -fq read1.fq samtools view -h name_sorted.bam | awk -f filter.awk | samtools view -Sb - | bedtools bamtofastq -i - -fq2 read2.fq
The filter.awk file is
# SIZE is the minimum template lenght BEGIN { FS="\t"; SIZE=100; S2=SIZE*SIZE } # Write the headers /^@/ { print $0; next } # Print only long entries { if ($9*$9 > S2) print $0}
This is pretty neat. The approach I was using would have involved multiple steps. it was something like first separate _1
and _2
reads
samtools view input.bam | awk '{if(and($2,0x40)){print > "file_1.sam"} else {print > "file_2.sam"};}'
then separating forward and reverse strand reads, then reverse complementing them, then filtering reads based on TLEN and merging everything back into one.
Filtering on Template length for paired-end reads (Your bam file should be name sorted)
samtools view sample.bam | \
awk '
BEGIN {FS='\t'}
{ if ($9 > 100 || $9 < 500) print $0}
' | \
grep -v ^@ | \
awk 'NR%2==1 {print "@"$1"\n"$10"\n+\n"$11}' > ./sample_1.fastq
samtools view sample.bam | \
awk '
BEGIN {FS='\t'}
{ if ($9 > 100 || $9 < 500) print $0}
' | \
grep -v ^@ | \
awk 'NR%2==0 {print "@"$1"\n"$10"\n+\n"$11}' > ./sample_2.fastq
Filtering on Template length for single-end reads
samtools view sample.bam | \
awk '
BEGIN {FS='\t'}
{ if ($9 > 100 || $9 < 500) print $0}
' | \
grep -v ^@ | \
awk '{print "@"$1"\n"$10"\n+\n"$11}' > ./sample.fastq
Reference here
14134125465346445 I have updated my answer to accommodate a range-based cutoff on read length.
14134125465346445 Yeah you are right. I got confused between read length & template length. Maybe you should try out Ashutosh Pandey's answer.
14134125465346445 I have updated my answer in order to address the correct problem.
You can try something like this. In order to print both reads for a read-pair you will have to use 'OR' operator as shown. I have used a threshold of 100 (highlighted in line 2) for TLEN.
samtools view input.bam | awk 'BEGIN {FS="\t"} {if ($9 > 100 || $9 < -100) print "@" $1 "\n" $10 "\n+\n" $11}' > input.fq
^_____________________^
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What cut-off do you want to set for TLEN?