Question

How Can I Deal With 50 Very Big .Bam Files Without Rg For Snp Calling?

1

Entering edit mode

13.2 years ago

Chris ▴ 40

Hi, every friends As described, I have 50 mapped big .bam files (human exome,50 individual, 3GB average) which have no RGs. So, I want to use Picard AddOrReplaceReadGroups to add RGs. The Question:

1:For each,for example, RGID=(1,2,3..50) RGLB=(Lb.1,2,3..50) ( RGPL=ILLUMINA RGSM=(Tibet1,2,3..50). Is my operation on adding different RG to .bam file RIGHT?

2:After geting 50 new RG-adding bams, I will use GATK to do the Base quality score recalibration and Local realignment. Should I do this 50times for every bam?? Or can I merge the 50 bams into a sigle one to do this or the downstream analysis like SNP calling? If can, how to merge and what's the Notice?

And can sb tell me how The 1000 Genomes do this? As this project has large amounts of data.

3:If not, it means I must get other 100 new bams, 400-500 GB total, 50 in GATK -T TableRecalibration and 50 in IndelRealigner process in BQSR and Local realignment.The computational cost is extreme sad. Is there another way?

4:Like the process BQSR and Local realignment and VQSR, we have known vcf to use in human, but if the data comes from other species which have no known vcf, then how can I process these parameter(example:-knownsite)??

Appreciate your timely reply! Thanks!

gatk snp • 3.1k views

ADD COMMENT • link updated 13.2 years ago by Arun 2.4k • written 13.2 years ago by Chris ▴ 40

score 1 · Answer 1 · 2012-05-17

1

Entering edit mode

13.2 years ago

Arun 2.4k

For the 1) yes, it seems right. GATK FAQ should provide some info on this: http://www.broadinstitute.org/gsa/wiki/index.php/Frequently_Asked_Questions#What.27s_the_meaning_of_the_standard_read_group_fields
2) You would want to merge the BAM files. That's one of the purposes of having unique read IDs. You can use picard-tools MergeSamFiles to merge all files together. They work on both SAM and BAM files. And BuildBamIndex to create bam index file (bai).
3) Sorry, I don't understand.
4) I/we created a script to get it to VCF format. If you look at the VCF format, it require 8 mandatory columns and we wrote a script to do it. Its very straightforward but just consumes time to write the script to convert our files.

ADD COMMENT • link 13.2 years ago by Arun 2.4k

0

Entering edit mode

Sincere thanks！ So,you mean I can merge them into one sigle big bam to apply the downstream analysis, then, the 3rd question has gone. The 4th, I mean not the what is the format of vcf but the usage. For example,-T IndelRealigner [--known /path/to/indels.vcf], what if I have no known vcf file for this species but I want to use this parameter?

ADD REPLY • link 13.2 years ago by Chris ▴ 40

0

Entering edit mode

Yes. you can use samtools to obtain SNPs that you could consider as a preliminary call and pass it as input to GATK. See this: http://samtools.sourceforge.net/mpileup.shtml

ADD REPLY • link 13.2 years ago by Arun 2.4k