Question

Bam File Output Format .. Don'T Understand!

2

Entering edit mode

12.5 years ago

madkitty ▴ 690

Here are2 lines of our alignment. I need to eliminate redundancy in this file so I deleted records that had the same start and end position. Now someone told me that we shouldn't do that without considering read pair 1 and 2 (positive and negative strand)

Here are two reads as example ..

HWI-ST478_0132:1:4:12387:52715#TGACCA    163    chrM    1002    60    100M    =    1059    157 ACGT[..]TCT    HHHHEG[..]CDF    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100    \N 
HWI-ST478_0132:2:5:5986:32346#TGACCA    163    chrM    1002    60    100M    =    1092    190    ACGT[..]TCT    HHHHEG[..]CD    XT:A:U    NM:i:0    SM:i:37    AM:i:37    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:100    \N

I went there http://samtools.sourceforge.net/SAM1.pdf and tried to understand the meaning of each column but it doesn't fit with our data. Can anyone explain to me how do you know which one is read 1and read 2 (a read pair...) Looking at it I just see a bunch of numbers .

Thanks :)

bam samtools bwa • 11k views

ADD COMMENT • link updated 12.5 years ago by Arun 2.4k • written 12.5 years ago by madkitty ▴ 690

score 13 · Answer 1 · 2012-05-29

You can remove duplicate reads with picard tools' MarkDuplicates (link from Jorge) as follows:

java -jar MarkDuplicates.jar INPUT=my_file_input.bam OUTPUT=my_file_output.bam METRICS_FILE=my_metrics.txt REMOVE_DUPLICATES=TRUE TMP_DIR=my_tmp_dir

That being said, it depends on what is your downstream analysis as to whether its essential to remove duplicates or not. I'd suggest to do so, if you're calling SNPs. If you're doing a gene expression analysis or anything that deals with count data, then probably not a good idea. However, there might be conflicting options in this regard, I believe.

Since you also asked for explanation regarding the different columns in your read:

1. QNAME: HWI-ST478_0132:1:4:12387:52715#TGACCA - your read ID.

In case of paired end reads, you'll have them twice, assuming both reads of the pair are mapped.

2. FLAG: 163 - read paired, read mapped in proper pair, its mate maps to reverse strand, this read is second in pair.

Probably hard to grasp for biologists at the beginning. To start with therefore, use this link to interpret them. Type in your flag number and it explains what they mean.

3. RNAME: chrM - chromosome ID
4. POS: 1002 - position on the genome where the first base of this read matches (refer to the documentation reg. 1-base and 0-based coordinates)
5. MAPQ: 60 - Mapping quality
6. CIGAR: 100M - CIGAR string

The CIGAR string might require some explanation. 100M indicates match/mismatch. N indicates spliced. I is insertion into the reference and D is deletion from the reference. Your cigar of 100M indicates that the read starts at 1002 and is mapped continuously for the next 100 bases. Note that still there might be mismatches. Suppose your cigar was 20M 1D 20M 60N 30M 1I 20M 100N 9M => read length = sum of all M/I/S operations. There is no S here. So, read length = 20+20+30+1+20+9=100. Suppose your read started at 1002, then, let's see how the read is for this cigar string.

The first 20 bases are continuously mapped to your genome. So, 1002 to 1002+20-1 => 1002-1021. Now there is a 1D, if you remove the base at 1022 of your reference the next 20 bases (20M) have a nice matching sequence. So, we'll have to add the 1D as well to obtain the genomic coordinates. Going once again, starting from 1002, we have a 20M, then 1D and then a 20M again. So, 1002+20+1+20-1 = 1042. So, the read matches nicely up until this point. Then the read is spliced for 60 bases. The next 60 bases are not in your read (presumably because its transcriptome data and this is an intron location). Continuing this way, the place where the reads are mapped to the chromosome are as follows:

1002 - 1042 => first 40 bases of read mapped to chromosome with 1 base in between which is not in your read but in the genome
1043 - 1102 => spliced region
1103 - 1152 => next 50 bases match to genome. You don't add the 1I because its in the read and not in the genome. 
1153 - 1252 => spliced region
1253 - 1261 => matching region again.

Coming back to the rest of the columns:

7. RNEXT: = read is paired ( * denotes single end reads)
8. PNEXT: 1059 - starting position of the next read in pair. Since your first read extends from 1002-1101, and the second read in pair starts at 1059, your reads overlap here.
9. TLEN: 157 - the distance from the left of the left-most mapped read and the right of the right most mapped read. I'd guess your other read in pair also has the same cigar string 100M, then we know it starts at 1059 and therefore it should end at 1059+100-1 = 1158. So, 1158-1002+1=157. 
10. SEQ: ACGT[..]TCT - sequence
11. QUAL: HHHHEG[..]CDF - quality
12. XT:A:U  NM:i:0  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:100  - Optional Flags,

The first 11 columns are mandatory in SAM format. From then all, all columns are optional and different softwares use different available and necessary options (some for their internal use in their pipeline downstream). XT:A:U is specific to BWA I suppose = Unique read. This read maps to this position alone. NM = number of mismatches = 0 for this read. And so on...

hope this helps!

score 3 · Answer 2 · 2012-05-29

eliminating redundancy from a bam file is definitely not a straightforward thing to do. if you are thinking about removing duplicates, which sounds somehow similar to what you are mentioning (same start and end, although not only that metrics are the ones to consider), samtools offers you the rmdup option, although the current reference algorithm for duplicate removal is Picard tools' MarkDuplicates, which in fact takes into account pairing information. I would definitely give it a try if you haven't come across it yet.