You can remove duplicate reads with picard tools' MarkDuplicates (link from Jorge) as follows:
java -jar MarkDuplicates.jar INPUT=my_file_input.bam OUTPUT=my_file_output.bam METRICS_FILE=my_metrics.txt REMOVE_DUPLICATES=TRUE TMP_DIR=my_tmp_dir
That being said, it depends on what is your downstream analysis as to whether its essential to remove duplicates or not. I'd suggest to do so, if you're calling SNPs. If you're doing a gene expression analysis or anything that deals with count data, then probably not a good idea. However, there might be conflicting options in this regard, I believe.
Since you also asked for explanation regarding the different columns in your read:
1. QNAME: HWI-ST478_0132:1:4:12387:52715#TGACCA - your read ID.
In case of paired end reads, you'll have them twice, assuming both reads of the pair are mapped.
2. FLAG: 163 - read paired, read mapped in proper pair, its mate maps to reverse strand, this read is second in pair.
Probably hard to grasp for biologists at the beginning. To start with therefore, use this link to interpret them. Type in your flag number and it explains what they mean.
3. RNAME: chrM - chromosome ID
4. POS: 1002 - position on the genome where the first base of this read matches (refer to the documentation reg. 1-base and 0-based coordinates)
5. MAPQ: 60 - Mapping quality
6. CIGAR: 100M - CIGAR string
The CIGAR string might require some explanation. 100M indicates match/mismatch. N indicates spliced. I is insertion into the reference
and D is deletion from the reference
. Your cigar of 100M
indicates that the read starts at 1002 and is mapped continuously for the next 100 bases. Note that still there might be mismatches. Suppose your cigar was 20M 1D 20M 60N 30M 1I 20M 100N 9M
=> read length = sum of all M/I/S operations. There is no S here. So, read length = 20+20+30+1+20+9=100.
Suppose your read started at 1002, then, let's see how the read is for this cigar string.
The first 20 bases are continuously mapped to your genome. So, 1002
to 1002+20-1
=> 1002-1021. Now there is a 1D
, if you remove the base at 1022 of your reference the next 20 bases (20M) have a nice matching sequence. So, we'll have to add the 1D as well to obtain the genomic coordinates. Going once again, starting from 1002, we have a 20M, then 1D and then a 20M again. So, 1002+20+1+20-1 = 1042. So, the read matches nicely up until this point. Then the read is spliced for 60 bases. The next 60 bases are not in your read (presumably because its transcriptome data and this is an intron location). Continuing this way, the place where the reads are mapped to the chromosome are as follows:
1002 - 1042 => first 40 bases of read mapped to chromosome with 1 base in between which is not in your read but in the genome
1043 - 1102 => spliced region
1103 - 1152 => next 50 bases match to genome. You don't add the 1I because its in the read and not in the genome.
1153 - 1252 => spliced region
1253 - 1261 => matching region again.
Coming back to the rest of the columns:
7. RNEXT: = read is paired ( * denotes single end reads)
8. PNEXT: 1059 - starting position of the next read in pair. Since your first read extends from 1002-1101, and the second read in pair starts at 1059, your reads overlap here.
9. TLEN: 157 - the distance from the left of the left-most mapped read and the right of the right most mapped read. I'd guess your other read in pair also has the same cigar string 100M, then we know it starts at 1059 and therefore it should end at 1059+100-1 = 1158. So, 1158-1002+1=157.
10. SEQ: ACGT[..]TCT - sequence
11. QUAL: HHHHEG[..]CDF - quality
12. XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 - Optional Flags,
The first 11 columns are mandatory in SAM format. From then all, all columns are optional and different softwares use different available and necessary options (some for their internal use in their pipeline downstream). XT:A:U is specific to BWA I suppose = Unique read. This read maps to this position alone. NM = number of mismatches = 0 for this read. And so on...
hope this helps!
You're great ! Thanks!