I am currently performing an analysis on pancan TCGA wxs data (>10000 normal-tumor pairs) . For my analysis I need to have the total coverage of each BAM file, so I can perform a depth normalization for on tumor vs matched normal sample.
Does anyone know where can I get total read number without downloading ~10000 samples BAM files?
My initial idea was to find the .bam.bai files and use them to find the total number of reads, but I couldn't find those files either.
If you already granted access to TCGA raw files, create a manifest for those samples, then remove bams and keep .bai files in the manifest file, then this:
Thanks for your respond Hamid.
I am going to make a dummy bam file and then use samtools idxstats to retrieve the number of reads from index file. The pitfall of this method is that you can't separate reads based on their flag/MAPQ, so for example you might count some of the reads that are non-uniquely mapped twice.
Make a dummy BAM file (or any BAM file for this matter)
Use samtools idxstats DUMMY.BAM to find the coverage info from each individual bam.bai file
Note: if you are running a script to count them one by one, at each step you should change the name of dummy.bam to the name of bam.bai file, so idxstats can read it!
Sum up the results from third column (mapped reads) for whichever sequence name you like (usually chr1:chrY)
If you already granted access to TCGA raw files, create a manifest for those samples, then remove
bam
s and keep.bai
files in the manifest file, then this:will download
.bai
files. Not sure what info you might get on read number from abai
file though.Thanks for your respond Hamid. I am going to make a dummy bam file and then use samtools idxstats to retrieve the number of reads from index file. The pitfall of this method is that you can't separate reads based on their flag/MAPQ, so for example you might count some of the reads that are non-uniquely mapped twice.