Hi, every friends As described, I have 50 mapped big .bam files (human exome,50 individual, 3GB average) which have no RGs. So, I want to use Picard AddOrReplaceReadGroups to add RGs. The Question:
1:For each,for example, RGID=(1,2,3..50) RGLB=(Lb.1,2,3..50) ( RGPL=ILLUMINA RGSM=(Tibet1,2,3..50). Is my operation on adding different RG to .bam file RIGHT?
2:After geting 50 new RG-adding bams, I will use GATK to do the Base quality score recalibration and Local realignment. Should I do this 50times for every bam?? Or can I merge the 50 bams into a sigle one to do this or the downstream analysis like SNP calling? If can, how to merge and what's the Notice?
And can sb tell me how The 1000 Genomes do this? As this project has large amounts of data.
3:If not, it means I must get other 100 new bams, 400-500 GB total, 50 in GATK -T TableRecalibration and 50 in IndelRealigner process in BQSR and Local realignment.The computational cost is extreme sad. Is there another way?
4:Like the process BQSR and Local realignment and VQSR, we have known vcf to use in human, but if the data comes from other species which have no known vcf, then how can I process these parameter(example:-knownsite)??
Appreciate your timely reply! Thanks!
Sincere thanks! So,you mean I can merge them into one sigle big bam to apply the downstream analysis, then, the 3rd question has gone. The 4th, I mean not the what is the format of vcf but the usage. For example,-T IndelRealigner [--known /path/to/indels.vcf], what if I have no known vcf file for this species but I want to use this parameter?
Yes. you can use samtools to obtain SNPs that you could consider as a preliminary call and pass it as input to GATK. See this: http://samtools.sourceforge.net/mpileup.shtml