Question

Coverage data for TCGA BAM files

0

Entering edit mode

3.5 years ago

enho ▴ 60

Hi everyone,

I am currently performing an analysis on pancan TCGA wxs data (>10000 normal-tumor pairs) . For my analysis I need to have the total coverage of each BAM file, so I can perform a depth normalization for on tumor vs matched normal sample.

Does anyone know where can I get total read number without downloading ~10000 samples BAM files?

My initial idea was to find the .bam.bai files and use them to find the total number of reads, but I couldn't find those files either.

Appreciate your help, Thanks

exome number tcga copy dna whole coverage • 1.9k views

ADD COMMENT • link 3.5 years ago by enho ▴ 60

1

Entering edit mode

If you already granted access to TCGA raw files, create a manifest for those samples, then remove bams and keep .bai files in the manifest file, then this:

gdc-client download -m gdc_manifest_XXXXX.txt  -t gdc-user-token.XXXXX.txt

will download .bai files. Not sure what info you might get on read number from a bai file though.

ADD REPLY • link 3.5 years ago by Hamid Ghaedi 3.3k

0

Entering edit mode

Thanks for your respond Hamid. I am going to make a dummy bam file and then use samtools idxstats to retrieve the number of reads from index file. The pitfall of this method is that you can't separate reads based on their flag/MAPQ, so for example you might count some of the reads that are non-uniquely mapped twice.

ADD REPLY • link 3.5 years ago by enho ▴ 60

score 3 · Accepted Answer · 2021-05-28

For anyone reading this later, I figured out how to do it: (Scripts are in R) (package "GenomicDataCommons" in bioconductor is used)

First get a Manifest for BAM files. Initially you can't get a manifest of .BAI files, you can only get BAM files manifest

manifest = GenomicDataCommons::files() %>%   
       GenomicDataCommons::filter(~ cases.project.project_id == "TCGA-KICH" &
       experimental_strategy == "WXS" &
       data_format == "BAM") %>%   GenomicDataCommons::manifest()

Using UUID of BAM files, run this query (according to here)

manifest.bai = lapply(manifest$id, function(uuid) {
       con = curl::curl(paste0("https://api.gdc.cancer.gov/files/", uuid, "?pretty=true&expand=index_files"))
       tbl = jsonlite::fromJSON(con)
       bai = data.table(id = tbl$data$index_files$file_id,
                   filename = tbl$data$index_files$file_name,
                   md5 = tbl$data$index_files$md5sum,
                   size = tbl$data$index_files$file_size,
                   state = tbl$data$index_files$state)
       return(bai)
})

Download the files
Make a dummy BAM file (or any BAM file for this matter)

Use samtools idxstats DUMMY.BAM to find the coverage info from each individual bam.bai file

    Note: if you are running a script to count them one by one, at each step you should change the name of dummy.bam to the name of bam.bai file, so idxstats can read it!

Sum up the results from third column (mapped reads) for whichever sequence name you like (usually chr1:chrY)