Question

Filtration Of Reads With Length Lower Than 30 From Bam

6

Entering edit mode

10.7 years ago

filipzembol ▴ 180

Dear all, I have one question how could I filtrate the reads from bam file, which have length of read lower than 30 bp. If it is lower than 30 bp, this rows will be deleted from bam file. I think I could use :

samtools view -h /home/filip/Desktop/rozdeleny\ bed_009_QCfailed/Ionfiltrovany1.bam | perl -lane '$l = 10; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 30 or /^@/' | samtools view -bS - > bar.bam

Tahnk you

read length bam • 22k views

ADD COMMENT • link updated 6 months ago by Hagen ▴ 10 • written 10.7 years ago by filipzembol ▴ 180

1

Entering edit mode

why looking at the cigar string $F[5] when you can just get the length of the SEQ ( $F[9] ) ? Is it really the LENGTH of the READ you're looking for ?

ADD REPLY • link 10.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Oh I think it is my mistake. I think at 10th column is the all read (ACTCG...) and Could I use this syntax for the 9th column to filtrate the reads with length lower than 30 ?

ADD REPLY • link 10.7 years ago by filipzembol ▴ 180

0

Entering edit mode

yes, but again, why do you want to use the 9th column=CIGAR instead of the 10th=ATGC sequence ? I would understand if you only wanted the number of reference bases that the read covers, excluding padding.

ADD REPLY • link 10.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

No I would like to filter the reads which have the read length lower than 30 bp. In final bam file will be only the reads, which have length higher than 30bp. Or I badly understand of your question? I don't know if my script is ok. I am starting bioinformatician...

ADD REPLY • link 10.7 years ago by filipzembol ▴ 180

Ram · Answer 1 · 2014-02-11

12

Entering edit mode

10.7 years ago

Ashutosh Pandey 12k

samtools view -h /home/filip/Desktop/rozdeleny\ bed_009_QCfailed/Ionfiltrovany1.bam | awk 'length($10) > 30 || $1 ~ /^@/' | samtools view -bS - > bar.bam

will give you bar.bam file with reads with length greater than 30 bp. Modified on 07/22 after dariober pointed out a bug.

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.7 years ago by Ashutosh Pandey 12k

4

Entering edit mode

Shouldn't it be ... | awk 'length($10) > 30 || $1 ~ /^@/' | ... ?

ADD REPLY • link 10.3 years ago by dariober 15k

0

Entering edit mode

Yup it was my bad. I should have double checked or tested it before submitting it. Thanks for pointing it out. I will modify it.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

How would this be adjusted to give a BAM file with a range say 1-80bp for the insert size?

ADD REPLY • link 5.9 years ago by rbronste ▴ 420

1

Entering edit mode

samtools view -h /home/filip/Desktop/rozdeleny\ bed_009_QCfailed/Ionfiltrovany1.bam | awk 'length($10) > 0 && length($10) < 80 || $1 ~ /^@/' | samtools view -bS - > bar.bam

ADD REPLY • link 4.8 years ago by oia8 ▴ 10

0

Entering edit mode

What is the "$1 ~ /^@/'" part for? Thanks!

ADD REPLY • link 2.4 years ago by I0110 ▴ 160

1

Entering edit mode

$1: https://www.gnu.org/software/gawk/manual/gawk.html#Fields

/^@/: https://www.gnu.org/software/gawk/manual/gawk.html#Regexp-Usage

~ : https://www.gnu.org/software/gawk/manual/gawk.html#index-_007e-_0028tilde_0029_002c-_007e-operator

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 164k

score 8 · Answer 2 · 2022-04-27

8

Entering edit mode

2.5 years ago

Pierre Lindenbaum 164k

2022 (8 years later)

samtools view -e 'length(seq)>30'  -O BAM -o out.bam in.bam

ADD COMMENT • link 2.5 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

thanks! This saved me quite some time, I was about to write code to calculate the aligned length from the CIGAR. But this can be done with

samtools view -e '(qlen-sclen)>30' -O BAM -o out.bam in.bam

This would include clipping, which gives you aligned length in comparison to total length of the query. Although I find the filtering description in samtools a bit misleading, because:

qlen int Alignment length: no. query bases

to me this suggests that qlen is already the aligned length, but it is apparently not. In the description it then says:

"sclen" and "hclen" are the number of soft and hard-clipped bases respectively. The formula "qlen-sclen" gives the number of sequence bases used in the alignment, distinguishing between global alignment and local alignment length.

Initially I thought why it is not (qlen-hclen-sclen) , but this quote;

When soft-clipping ( S ), these unaligned bases are still stored in the SAM file's SEQ field. With hard-clipping ( H ), however, they are not.

from here explains why ...

ADD REPLY • link 6 months ago by Hagen ▴ 10

0

Entering edit mode

Just wondering (answer is most likely yes because it's samtools), but does this take full care of CIGAR trimming etc?

By the way, for further read of filtering expressions in samtools: http://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS

Users should also be aware that this is a releatively recent addition. Afaik it's version 1.14 that -e was added so be sure your version is recent.

ADD REPLY • link 2.5 years ago by ATpoint 85k

1

Entering edit mode

just wondering (answer is most likely yes because it's samtools), but does this take full care of CIGAR trimming etc?

no, I think there are some other fields:

qlen    int Alignment length: no. query bases
rlen    int Alignment length: no. reference bases

ADD REPLY • link 2.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

But note also that the docs say:

qlen" and "rlen" are measured using the CIGAR string to count the number of query (sequence) and reference bases consumed. Note "qlen" may not exactly match the length of the "seq" field if the sequence is "*".

Not that it really matters, -e filtering should have been introduced in v1.12 - but I found out about it only now thanks to this post!

ADD REPLY • link 2.5 years ago by dariober 15k