Question

meaning of XM:i:N in Bowtie output

1

Entering edit mode

8.2 years ago

Don ▴ 10

Hi, I am a newbie for NGS, I was experimenting with chip-seq analysis with galaxy-bowtie. Question is: what does the XM:i:N mean in the bowtie file? Reason I ask is: My research shows it means (from bowtie manual) : "For a read with no reported alignments, <N> is 0 if the read had no alignments. If [-m] was specified and the read's alignments were supressed because the [-m] ceiling was exceeded, <N> equals the [-m] ceiling + 1, to indicate that there were at least that many valid alignments (but all were suppressed). In [-M] mode, if the alignment was randomly selected because the [-M] ceiling was exceeded, <N> equals the [-M] ceiling + 1, to indicate that there were at least that many valid alignments (of which one was reported at random)." I've pasted few lines from the bowtie output. I noticed when there is a unique match, the XM:I field is not shown. When XM:i:1 or XM:i:0, there is no match. Problem is shouldn't XM:i:1 mean there is a match???? Here is the bowtie parameters: ID:Bowtie VN:0.12.7 CL:"bowtie -q -p 6 -S -n 2 -e 70 -l 28 --maxbts 125 -k 1 -m 1 --phred33-quals /galaxy/data/hg19/hg19full/bowtie_index/hg19full /galaxy-repl/main/files/019/148/dataset_19148014.dat"

QNAME   FLAG    RNAME   POS MAPQ    CIGAR   MRNM    MPOS    ISIZE   SEQ QUAL    OPT
SRR4237722.115 HS38_18889:1:1101:10918:1975 length=50   4   *   0   0   *   *   0   0   GGGTCNCACTCTGTCACCCAGGCTGGAGTGCAGTGGCCGTGATCTCGGCT  BCCDF!2AAFFHHIJJIJIJFHIIDGIGFCEGGHIIJGIDHIIJIEEHHH  XM:i:1
SRR4237722.108 HS38_18889:1:1101:10120:1993 length=50   4   *   0   0   *   *   0   0   ATTCCATTCCACTCCATTCCATTCCAATCCTTTCCGCTCGGGTTGATTCC  CCCFFEFFGHHHGJIJJIIIIIJJJJJJIJJJIJJJJJJJII?FHIIJJJ  XM:i:0
SRR4237722.168  0   chr13   75065209    255 50M *   0   0   CACTGGAGCCTCTGATATCTTTTTTCTTTTTCCTGCTGACACAGGACTGC  CCCFFFDEHHHHHIIJJJJJJJJJJIJIGGJIJJIIJJJJJIJJJIJIJJ  XA:i:0 MD:Z:50 NM:i:0
SRR4237722.94   0   chr3    121413904   255 50M *   0   0   TGATGNAACCGATTCTGAACATGTAGGTCTTGTGCTCATACTCAGAGAGT  @@?DD!2=CFHHHFHIIF>GHIIHCGI*@FGHFGIIIIHIIHGIIIGIID  XA:i:1 MD:Z:5G44 NM:i:1

ChIP-Seq • 8.0k views

ADD COMMENT • link updated 8.2 years ago by Sej Modha 5.3k • written 8.2 years ago by Don ▴ 10

0

Entering edit mode

I will really appreciate any help.

ADD REPLY • link 8.2 years ago by Don ▴ 10

score 1 · Answer 1 · 2017-06-13

You can check details about these tags here:

SAM and BAM filtering one-liners

@author: David Fredman, david.fredmanAAAAAA@gmail.com (sans poly-A tail)
@dependencies: http://sourceforge.net/projects/bamtools/ and http://samtools.sourceforge.net/

Please extend with additional/faster/better solutions via a pull request!

BWA mapping (using piping for minimal disk I/O)

bwa aln -t 8 targetGenome.fa reads.fastq | bwa samse targetGenome.fa - reads.fastq\
| samtools view -bt targetGenome.fa - | samtools sort - reads.bwa.targetGenome

samtools index reads.bwa.targetGenome.bam

Count number of records (unmapped reads + each aligned location per mapped read) in a bam file:

samtools view -c filename.bam

Count with flagstat for additional information:

samtools flagstat filename.bam

Count the number of alignments (reads mapping to multiple locations counted multiple times)

samtools view -F 0x04 -c filename.bam

Count number of mapped reads (not mapped locations) for left and right mate in read pairs

samtools view -F 0x40 filename.bam | cut -f1 | sort | uniq | wc -l
samtools view -f 0x40 -F 0x4 filename.bam | cut -f1 | sort | uniq | wc -l #left mate
samtools view -f 0x80 -F 0x4 filename.bam | cut -f1 | sort | uniq  | wc -l #right mate

Remove unmapped reads, keep the mapped reads:

samtools view -F 0x04 -b in.bam > out.aligned.bam

Count UNmapped reads:

samtools view -f4 -c in.bam

Require minimum mapping quality (to retain reliably mapped reads):

samtools view -q 30 -b in.bam > aligned_reads.q30.bam
samtools view -q 30 -c in.bam #to count alignments with score >30

Require match to be on the sense strand of the reference (samtools flag)

samtools view -F 16

Require match to be on antisense strand (samtools flag)

samtools view -f 16

Require at least N matches at the start of the read:

$N=6
samtools view in.bam \
| perl -lane 'next unless $F[5] =~ /^(\d+)M/;print if $1 >= $N;'

Filter by number of mismatches in BWA generated output, use BWA-specific flag:

Tag Meaning
NM     Edit distance
MD     Mismatching positions/bases
AS     Alignment score
BC     Barcode sequence
X0     Number of best hits
X1     Number of suboptimal hits found by BWA
XN     Number of ambiguous bases in the reference
XM     Number of mismatches in the alignment
XO     Number of gap opens
XG     Number of gap extentions
XT     Type: Unique/Repeat/N/Mate-sw
XA     Alternative hits; format: (chr,pos,CIGAR,NM;)*
XS     Suboptimal alignment score
XF     Support from forward/reverse alignment
XE     Number of supporting seeds

To keep only reads that map without any mismatches:

bamtools filter -tag XM:0 -in reads.bam -out reads.noMismatch.bam

Retain only uniquely mapping reads (reads with a single unambigous mapping location):

If BWA was used it is possible to use the BWA XT flag value U for unique (analogously, R is for repeat). I did not find a simple way to do this with samtools or bamtools, so grep to the rescue:

samtools view reads.bam | grep 'XT:A:U' | samtools view -bS -T referenceSequence.fa - > reads.uniqueMap.bam

However, the concept of "uniquely mapping" is not the cleanest idea - in most scenarios any given read could be placed elsewhere although it may be a lower scoring alignment. Thus, you could instead filter based on mapping quality, to retain the "reliably mapped" reads. Different mappers have different scoring models. As a rule of thumb, min values of 5 or 10 will work well. If you used bowtie/bowtie2, try:

samtools view -b -q 10 foo.bam > foo.filtered.bam

view raw bamfilter_oneliners.md hosted with ❤ by GitHub