Filtration Of Reads With Length Lower Than 30 From Bam
2
6
Entering edit mode
10.8 years ago
filipzembol ▴ 180

Dear all, I have one question how could I filtrate the reads from bam file, which have length of read lower than 30 bp. If it is lower than 30 bp, this rows will be deleted from bam file. I think I could use :

samtools view -h /home/filip/Desktop/rozdeleny\ bed_009_QCfailed/Ionfiltrovany1.bam | perl -lane '$l = 10; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 30 or /^@/' | samtools view -bS - > bar.bam

Tahnk you

read length bam • 22k views
ADD COMMENT
1
Entering edit mode

why looking at the cigar string $F[5] when you can just get the length of the SEQ ( $F[9] ) ? Is it really the LENGTH of the READ you're looking for ?

ADD REPLY
0
Entering edit mode

Oh I think it is my mistake. I think at 10th column is the all read (ACTCG...) and Could I use this syntax for the 9th column to filtrate the reads with length lower than 30 ?

ADD REPLY
0
Entering edit mode

yes, but again, why do you want to use the 9th column=CIGAR instead of the 10th=ATGC sequence ? I would understand if you only wanted the number of reference bases that the read covers, excluding padding.

ADD REPLY
0
Entering edit mode

No I would like to filter the reads which have the read length lower than 30 bp. In final bam file will be only the reads, which have length higher than 30bp. Or I badly understand of your question? I don't know if my script is ok. I am starting bioinformatician...

ADD REPLY
12
Entering edit mode
10.8 years ago
samtools view -h /home/filip/Desktop/rozdeleny\ bed_009_QCfailed/Ionfiltrovany1.bam | awk 'length($10) > 30 || $1 ~ /^@/' | samtools view -bS - > bar.bam

will give you bar.bam file with reads with length greater than 30 bp. Modified on 07/22 after dariober pointed out a bug.

ADD COMMENT
4
Entering edit mode

Shouldn't it be ... | awk 'length($10) > 30 || $1 ~ /^@/' | ... ?

ADD REPLY
0
Entering edit mode

Yup it was my bad. I should have double checked or tested it before submitting it. Thanks for pointing it out. I will modify it.

ADD REPLY
0
Entering edit mode

How would this be adjusted to give a BAM file with a range say 1-80bp for the insert size?

ADD REPLY
1
Entering edit mode
samtools view -h /home/filip/Desktop/rozdeleny\ bed_009_QCfailed/Ionfiltrovany1.bam | awk 'length($10) > 0 && length($10) < 80 || $1 ~ /^@/' | samtools view -bS - > bar.bam
ADD REPLY
8
Entering edit mode
2.6 years ago

2022 (8 years later)

samtools view -e 'length(seq)>30'  -O BAM -o out.bam in.bam
ADD COMMENT
1
Entering edit mode

thanks! This saved me quite some time, I was about to write code to calculate the aligned length from the CIGAR. But this can be done with

samtools view -e '(qlen-sclen)>30' -O BAM -o out.bam in.bam

This would include clipping, which gives you aligned length in comparison to total length of the query. Although I find the filtering description in samtools a bit misleading, because:

qlen int Alignment length: no. query bases

to me this suggests that qlen is already the aligned length, but it is apparently not. In the description it then says:

"sclen" and "hclen" are the number of soft and hard-clipped bases respectively. The formula "qlen-sclen" gives the number of sequence bases used in the alignment, distinguishing between global alignment and local alignment length.

Initially I thought why it is not (qlen-hclen-sclen) , but this quote;

When soft-clipping ( S ), these unaligned bases are still stored in the SAM file's SEQ field. With hard-clipping ( H ), however, they are not.

from here explains why ...

ADD REPLY
0
Entering edit mode

Just wondering (answer is most likely yes because it's samtools), but does this take full care of CIGAR trimming etc?

By the way, for further read of filtering expressions in samtools: http://www.htslib.org/doc/samtools.html#FILTER_EXPRESSIONS

Users should also be aware that this is a releatively recent addition. Afaik it's version 1.14 that -e was added so be sure your version is recent.

ADD REPLY
1
Entering edit mode

just wondering (answer is most likely yes because it's samtools), but does this take full care of CIGAR trimming etc?

no, I think there are some other fields:

qlen    int Alignment length: no. query bases
rlen    int Alignment length: no. reference bases
ADD REPLY
0
Entering edit mode

But note also that the docs say:

qlen" and "rlen" are measured using the CIGAR string to count the number of query (sequence) and reference bases consumed. Note "qlen" may not exactly match the length of the "seq" field if the sequence is "*".


Not that it really matters, -e filtering should have been introduced in v1.12 - but I found out about it only now thanks to this post!

ADD REPLY

Login before adding your answer.

Traffic: 2937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6