Samtools Filter Reads Cigar Field
1
5
Entering edit mode
11.3 years ago
dfernan ▴ 770

Hi,

I have a bam file that I'd like to filter. I want to filter out all the reads that were aligned to intronic regions - i.e., CIGAR field containing an N.

Anyone familiar with a way to filter out reads with CIGAR field containing an N?

Note: I could convert all the Bams to Sams and then write my own custom script, but was wondering if it'd be possible with samtools or picard tools directly, couldn't find any direct instruction.

Note2: The bam was generated by aligning mRNA-Seq to the genome.

Please let me know.

samtools cigar • 16k views
ADD COMMENT
4
Entering edit mode
samtools view -h input.bam | awk '$6 !~ /N/ || $1 ~ /^@/' | samtools view -bS - > not-n-output.bam
samtools view -h input.bam | awk '$6 ~ /N/ || $1 ~ /^@/' | samtools view -bS - > yes-n-output.bam
ADD REPLY
1
Entering edit mode

I'm also getting reads that were unmapped, so I added a filter to make sure $6 is real and the read is mapped, but unspliced:

samtools view -h $BAM | awk '($6 !~ /N/ && $6 !~ /*/) || $1 ~ /@/' | samtools view -bS - > $BAM.unspliced.bam

@brentp's script will also work, with slight modification to filter out unmapped reads (bit 0x2):

samtools view -h -F 4 $BAM | awk '$6 !~ /N/ || $1 ~ /@/' | samtools view -bS - > $BAM.unspliced.bam
ADD REPLY
0
Entering edit mode

@brentp, thanks, looks like what I was looking for. I thought I could do it using samtools directly (no conversion to sam) but this is good enough.

ADD REPLY
0
Entering edit mode

Do you actually want to filter out reads aligning to introns or reads that are spliced? Filtering out reads with N in the cigar string will filter spliced reads, which will typically not map to an intron but, rather, span over them.

ADD REPLY
0
Entering edit mode

@dpryan, thanks. Yes, I'd like to filter reads that are spliced, i.e., that span an intron

ADD REPLY
0
Entering edit mode
11.3 years ago
Rm 8.3k

Is the bam represents mRNA to genome alignment? if yes, then N in CIGAR can used to identify the potential introns; otherwise Ns interpretation is not defined....

ADD COMMENT
0
Entering edit mode

@Rm, yes, that's correct. the bam represents mRNA to genome alignment.

ADD REPLY
0
Entering edit mode

use Lincoln D. Stein > Bio-SamTools > Bio::DB::Sam perl module to parse the CIGARs

ADD REPLY

Login before adding your answer.

Traffic: 1301 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6