Coverage data for TCGA BAM files
1
0
Entering edit mode
3.5 years ago
enho ▴ 60

Hi everyone,

I am currently performing an analysis on pancan TCGA wxs data (>10000 normal-tumor pairs) . For my analysis I need to have the total coverage of each BAM file, so I can perform a depth normalization for on tumor vs matched normal sample.

Does anyone know where can I get total read number without downloading ~10000 samples BAM files?

My initial idea was to find the .bam.bai files and use them to find the total number of reads, but I couldn't find those files either.

Appreciate your help, Thanks

exome number tcga copy dna whole coverage • 1.9k views
ADD COMMENT
1
Entering edit mode

If you already granted access to TCGA raw files, create a manifest for those samples, then remove bams and keep .bai files in the manifest file, then this:

gdc-client download -m gdc_manifest_XXXXX.txt  -t gdc-user-token.XXXXX.txt

will download .bai files. Not sure what info you might get on read number from a bai file though.

ADD REPLY
0
Entering edit mode

Thanks for your respond Hamid. I am going to make a dummy bam file and then use samtools idxstats to retrieve the number of reads from index file. The pitfall of this method is that you can't separate reads based on their flag/MAPQ, so for example you might count some of the reads that are non-uniquely mapped twice.

ADD REPLY
3
Entering edit mode
3.5 years ago
enho ▴ 60

For anyone reading this later, I figured out how to do it: (Scripts are in R) (package "GenomicDataCommons" in bioconductor is used)

  1. First get a Manifest for BAM files. Initially you can't get a manifest of .BAI files, you can only get BAM files manifest
manifest = GenomicDataCommons::files() %>%   
       GenomicDataCommons::filter(~ cases.project.project_id == "TCGA-KICH" &
       experimental_strategy == "WXS" &
       data_format == "BAM") %>%   GenomicDataCommons::manifest()
  1. Using UUID of BAM files, run this query (according to here)
manifest.bai = lapply(manifest$id, function(uuid) {
       con = curl::curl(paste0("https://api.gdc.cancer.gov/files/", uuid, "?pretty=true&expand=index_files"))
       tbl = jsonlite::fromJSON(con)
       bai = data.table(id = tbl$data$index_files$file_id,
                   filename = tbl$data$index_files$file_name,
                   md5 = tbl$data$index_files$md5sum,
                   size = tbl$data$index_files$file_size,
                   state = tbl$data$index_files$state)
       return(bai)
})
  1. Download the files
  2. Make a dummy BAM file (or any BAM file for this matter)
  3. Use samtools idxstats DUMMY.BAM to find the coverage info from each individual bam.bai file
        Note: if you are running a script to count them one by one, at each step you should change the name of dummy.bam to the name of bam.bai file, so idxstats can read it!
    
  4. Sum up the results from third column (mapped reads) for whichever sequence name you like (usually chr1:chrY)
ADD COMMENT

Login before adding your answer.

Traffic: 2455 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6