Question

bam file filter by TLEN and output as fastq

1

Entering edit mode

10.8 years ago

14134125465346445 ★ 3.6k

I have a bam file that I want to filter by the TLEN (9th) column, then output the reads that pass filter in fastq format. I have tried combinations of awk and samtools bam2fq without much luck.

Anybody?

bam fastq samtools • 7.4k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

What cut-off do you want to set for TLEN?

ADD REPLY • link 10.8 years ago by komal.rathi ★ 4.1k

Ram · Answer 1 · 2014-10-01

4

Entering edit mode

10.8 years ago

Istvan Albert 102k

You could combine the suggestions so far into a script like so (the BAM file should be name sorted):

samtools view -h name_sorted.bam  | awk -f filter.awk | samtools view -Sb - | bedtools bamtofastq -i - -fq read1.fq
samtools view -h name_sorted.bam  | awk -f filter.awk | samtools view -Sb - | bedtools bamtofastq -i - -fq2 read2.fq

The filter.awk file is

# SIZE is the minimum template lenght
BEGIN { FS="\t"; SIZE=100; S2=SIZE*SIZE }

# Write the headers 
/^@/ { print $0; next }

# Print only long entries
{ if ($9*$9 > S2) print $0}

ADD COMMENT • link 10.8 years ago by Istvan Albert 102k

0

Entering edit mode

This is pretty neat. The approach I was using would have involved multiple steps. it was something like first separate _1 and _2 reads

samtools view input.bam | awk '{if(and($2,0x40)){print > "file_1.sam"} else {print > "file_2.sam"};}'

then separating forward and reverse strand reads, then reverse complementing them, then filtering reads based on TLEN and merging everything back into one.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

Ram · Answer 2 · 2014-09-30

1

Entering edit mode

10.8 years ago

komal.rathi ★ 4.1k

Filtering on Template length for paired-end reads (Your bam file should be name sorted)

samtools view sample.bam | \
  awk '
    BEGIN {FS='\t'}
    { if ($9 > 100 || $9 < 500) print $0}
  ' | \
  grep -v ^@ | \
  awk 'NR%2==1 {print "@"$1"\n"$10"\n+\n"$11}' > ./sample_1.fastq

samtools view sample.bam | \
  awk '
    BEGIN {FS='\t'}
    { if ($9 > 100 || $9 < 500) print $0}
  ' | \
  grep -v ^@ | \
  awk 'NR%2==0 {print "@"$1"\n"$10"\n+\n"$11}' > ./sample_2.fastq

Filtering on Template length for single-end reads

samtools view sample.bam | \
  awk '
    BEGIN {FS='\t'}
    { if ($9 > 100 || $9 < 500) print $0}
  ' | \
  grep -v ^@ | \
  awk '{print "@"$1"\n"$10"\n+\n"$11}' > ./sample.fastq

Reference here

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Can the TLEN_CUTOFF be a range? For example, between 100-500?

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

I don't think you can do that. I'll check if there is a way around this.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

14134125465346445 I have updated my answer to accommodate a range-based cutoff on read length.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

I would like to filter by TLEN, which if my understanding is correct, is the Template length, not actually the read length, is that right?

ADD REPLY • link 10.8 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

14134125465346445 Yeah you are right. I got confused between read length & template length. Maybe you should try out Ashutosh Pandey's answer.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

14134125465346445 I have updated my answer in order to address the correct problem.

ADD REPLY • link 10.8 years ago by komal.rathi ★ 4.1k

Ram · Answer 3 · 2014-09-30

0

Entering edit mode

10.8 years ago

Ashutosh Pandey 12k

You can try something like this. In order to print both reads for a read-pair you will have to use 'OR' operator as shown. I have used a threshold of 100 (highlighted in line 2) for TLEN.

samtools view input.bam | awk 'BEGIN {FS="\t"} {if ($9 > 100 || $9 < -100) print "@" $1 "\n" $10 "\n+\n" $11}' > input.fq
                                                   ^_____________________^

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k

0

Entering edit mode

this won't be identical to the original fastq file though since the SEQ field in a SAM file is displayed reverse complemented when the read aligns on the reverse strand (everything is on the forward strand).

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Istvan Albert 102k

0

Entering edit mode

Yup I totally forgot about that. I will have to either use FLAG info or polarity of TLEN (if concordant reads to be considered only) to check for the reads aligned on the reverse strand and print reverse complement seq for them. I will update my answer later.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ashutosh Pandey 12k