This is my scenario: I am doing whole genome amplification from human cells, and then doing NGS with an Illumina platform. I get bam files from the machine. I am trying to find out how many times the mitochondrial DNA is represented in each sample. I know that mitochondrial DNA gets amplified during my whole genome amplification (I confirmed with the manufacturer of the kit). I've gotten to the point where I downloaded IGV and can map the reads to the human mitochondrial genome. I can see the mapped reads. But I cannot figure out how to get a specific value, i.e. a count of the specific number of mapped reads in each sample. The idea is to be able to tell if one sample had more mitochondrial DNA than another sample.
I am working on a Windows 7 64-bit computer. (But could have access to a mac as well)
Thanks so much any help is greatly appreciated!
Hope you are mapping reads to other parts of the genome as well? Meaning: whole (human?) genome. Because if all what you got are BAM files with reads mapped to just 16kb MT, the number of reads mapped will not tell you much about number of MT per cell. For that you will need a number of reads mapped to all chromosomes and to MT.
Also: Which mapper?
Yes, the Illumina-sponsored software I am using gives me an absolute number of reads mapped to autosomes. So I was going to divide the number of reads to mitochondrial DNA by the number I get for autosomal mapped reads. The Mapper is called Bluefuse. I can map all my reads to the human genome, but I don't have the option to map to the mitochondrial genome. That would have been easiest.
Because that would be pointless (unless your question is: did I got mitochondrial DNA in my sample?). The input FASTQ files differ in the total number of reads and in their quality. The two numbers:
MT_mapped_in_A
vsMT_mapped_in_B
(assuming that taking 16kb out of 3bilons bases will not screw up the mappings). So you can have A tissue rich with mitochondria but with low DNA yield and really crappy sequencing vs B tissue having lower number of MT per cell, but on which the sun was shining the day(s) it was prepared and sequenced. So you gotMT_mapped_in_B
>MT_mapped_in_A
. Not so great, no?I am trying to understand the point you are making. Would it be ok to do the comparison if I use BAM files only? For example SampleA_Reads in MT/SampleA_Reads in Human Genome VERSUS SampleB_Reads in MT/SampleB_Reads in Human Genome. If the first is a larger number than the second, I can assume there is more mtDNA in the first sample, correct?
Yes, you are right, this will give you the idea about the numbers of MT/cell in samples. If you want to get exact (well..) numbers, you will have to take into account the size of nuclear genome (2x3000Mb or rather 2x what is the non-MT contigs length in your BAM header SQ lines) vs MT (16kb). It is a bit more complicated (look for Genomic alignment and mappability), but better lets solve one issue at the time.
MT sequence may have different names depending which genome version you are using. Are you able to load any of your BAM files (preferably the smallest) in IGV on your Windows, and check how chromosomes/MT sequences are named? Meaning i.e: chr1 vs 1, and chrMT vs MT.
Probably booting your PC using LiveCD with Ubuntu/Lubuntu Linux, installing samtools using Synaptic (or apt-get) and processing your BAM files i.e. located on some USB drive will be hopefully the fastest and easiest route. Just check if BAMs you got out of Bluefuse have been sorted and indexed (files with extension .bai). If not, give us the total number of BAMs and their average size. Because using some social skills and getting access to some Linux server/workstation makes more sense than fighting with terabyte of results on i.e 5 years old laptop for days on end.
In any case, google/check once on Linux with samtools installed:
(it will give you the text output for all contigs/chromosomes).
You will need a bit of command-line scripting for processing say more than 10 BAMs and few installed on Linux utilities (grep, awk) to grab the numbers from the samtools output.
I checked, and from Bluefuse I three files for each sample: .fastq, .bam and .bam.bai
Form your comment above, I gather that using the .bam.bai would be best
You need both. Usage:
samtools idxstats <in.bam>
Samtools will look for in.bam.bai See: http://www.htslib.org/doc/samtools.html