Filtering nth length from mapped SAM/BAM file.
2
1
Entering edit mode
9.4 years ago
unique379 ▴ 120

Hi guys,

I am trying to filter reads less than 18nt from my mapped BAM file and then want to get back into SAM format but there is problem while converting back. To do so I did...

samtools view mapped.bam | awk '$6 >= 18' > filt.bam
samtools view -h -o out.sam filt.bam

[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[main_samview] fail to read the header from "filt.bam".

???

I guess when I filter my reads, it becomes text instead BAM, right?

Help me please.

Note: My original SAM had header information.

bash RNA-Seq miRNAseq shell • 8.3k views
ADD COMMENT
1
Entering edit mode

read length or alignment lenth ?

ADD REPLY
0
Entering edit mode

As I am already using mapped.bam so its obvious. its aligned reads.

ADD REPLY
1
Entering edit mode

Some aligners using soft-clipping, which let the read length untouched but the alignment length can be shorter (see e.g. here)

ADD REPLY
1
Entering edit mode
9.4 years ago

Using my tool samjs: https://github.com/lindenb/jvarkit/wiki/SamJS

java -jar distsamjs.jar -e '!record.readUnmappedFlag  && record.cigar.referenceLength  >= 18 ' input.bam | samtools view -Sb -o out.bam -
ADD COMMENT
0
Entering edit mode

Dear Pierre,

I have tried your tool to filter mappings with a minimal mapped length. I see that your filtering is based on the mapped length measured on the reference. I have encountered mapping with CIGAR looking like this 106S8M1D21M14S. Based on your command the mapping does not get filtered out when I set the value to >=30. This is correct since you measure by the mapping length calculated from the reference. The mapped read length measured on the reference is 30 (8M+1D+21M). But, if we would count the mapped length from the read itself we would get just 29 (8M+21M). I could imagine an extreme case with special mappings (which would be reported for whatever reason) with a lot of insertions/deletions "closed" between several mapped regions. I assume your filtering would keep such mappings because the mapped length measured by the reference would be still OK. However, I would not consider such mappings as "correct" which I would like to keep. FInaly for my question - Is there an option in your tool to filter by mapped length calculated from the read itself and not from the reference?

ADD REPLY
1
Entering edit mode
record.getReadLength()>=18
ADD REPLY
0
Entering edit mode
9.4 years ago
michael.ante ★ 3.9k

It seems, you try to use the cigar field to get the read length. Just have a look at your data:

samtools view mapped.bam | cut -f 6 | head

In case you have a splice-aware mapper, you'll get something like :

46M
94M
10M100N84M
61M
94M
94M

In this case, you have to use a cigar parser, as mentioned e.g. here.

If you have no junctions and no soft-clipping you can go for

samtools view mapped.bam | awk 'length($10) >= 18' > filt.sam

Denote, that you need the header :

samtools view -H mapped.bam > header.txt
cat header.txt filt.sam | samtools view -bSh -o out.bam -

Cheers,
Michael

ADD COMMENT
0
Entering edit mode

Thank Michael for your elegant answer, but do tell me, filtered read would have no affect in header? As you have concatenate the previous header from original SAM into filt.bam?

ADD REPLY
1
Entering edit mode

With a simple samtools view command, you'll skip the header. You can force to print the header with samtools view -h ... . Then, you'll need to change the filtering to something like

samtools view -h mapped.bam | awk '/^@/ || length($10) >= 18' > filt.sam
ADD REPLY

Login before adding your answer.

Traffic: 1729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6