Working with 1000 Genome Project bam files (NA12878, pilot2), I'd need to know how the reads have been mapped and in particular if there are "unique mapping" for all these reads or if reads have multiple possible mapping.
Working with 1000 Genome Project bam files (NA12878, pilot2), I'd need to know how the reads have been mapped and in particular if there are "unique mapping" for all these reads or if reads have multiple possible mapping.
Are there any @PG lines in your BAM file header?
samtools view -H file.bam | grep "^@PG"
This displays only the @PG lines of the header; if any.
I have downloaded NA12878 once from the Broad and it seems that the the answer is 'bwa 0.5.7' GATK realigner seems to have been used as well.
And if it was indeed mapped with BWA you can check for the 'X0' and 'X1' tag in the alignment (if present, number of best hits ans suboptimal hits respectively)
I think NA12878 was sequenced using Illumina and mapped by bwa with following commands->
bwa index -a bwtsw $ref (where $ref is the reference fasta file)
bwa aln -q 15 $ref $fq > $sai
bwa samse $ref $sai $fq > $bam
You can find the information in section B Alignment Process, here.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
No @PG in header. I had verified this point already. Thanks anyway.
Regarding the X0, X1, this is a very good idea, but unfortunately it does not seem to be present (I assume, according to SAM format specifications that "X0" or "X1" might be present): samtools view NA12878.chrom1.SLX.maq.SRP000032.2009_07.bam | grep X0