GATK multiple samples
2
0
Entering edit mode
5.2 years ago
evelyn ▴ 230

Dear all,

I am trying to call SNPs for multiple sorted bam files altogether using GATK. I am using this code:

    INPUT_DIR=path
    REF_DIR=path

    INPUT=$(ls -1 all/*.sorted.bam | sed -n ${RUN}p)
    SAMPLE=$(basename "${INPUT}/" .sorted.bam)

    OUT_DIR=path

    java -jar picard.jar ValidateSamFile \
    I=${INPUT_DIR}/${SAMPLE}.sorted.bam \
    MODE=SUMMARY
    java -jar picard.jar AddOrReplaceReadGroups \
    I=${INPUT_DIR}/${SAMPLE}.sorted.bam \
    O=${OUT_DIR}/${SAMPLE}.gatk.bam \
    RGID=1 \
    RGLB=lib1 \
    RGPL=illumina \
    RGPU=unit1 \
    RGSM=20
    java -jar picard.jar ValidateSamFile \
    I=${OUT_DIR}/${SAMPLE}.gatk.bam \
    MODE=SUMMARY
    samtools index ${OUT_DIR}/${SAMPLE}.gatk.bam
    gatk --java-options "-Xmx4G" HaplotypeCaller -R ${REF_DIR}/ref.fa -I ${OUT_DIR}/${SAMPLE}.gatk.bam -O ${OUT_DIR}/final.vcf

However, I think it is not taking multiple files altogether. I am getting vcf for only one sample but I want one vcf for all samples together. Can you please help me figure out how to get a single vcf files for all the sorted bam files in INPUT_DIR. Thank you for your time and help!

snp • 5.9k views
ADD COMMENT
2
Entering edit mode
5.2 years ago
Dave Carlson ★ 2.1k

You need to run HaplotypeCaller individually for each of your bam files using the "-ERC GVCF" flag. This will produce once gvcf file for each of your bam files. Then you would combine each of the GVCF files produced by HaplotypeCaller (e.g., with gatk's CombineGVCFs tool) into a single GVCF file. Finally, you run GenotypeGVCFs on the combined GVCF file to get your SNPs.

So for example, run Haplotype caller like this for each of your bams:

 gatk --java-options "-Xmx4g" HaplotypeCaller  \
   -R $reference \
   -I input.bam \
   -O output1.g.vcf.gz \
   -ERC GVCF

Then, combine all your gvcf files:

gatk --java-options "-Xmx96g -Xms96g" CombineGVCFs \
-R $reference \
--variant output1.g.vcf.gz \
--variant output2.g.vcf.gz \
--variant output3.g.vcf.gz \
-O combined.g.vcf.gz

Finally, perform joint-genotyping on the combined gvcf:

gatk --java-options "-Xmx96g -Xms96g" GenotypeGVCFs \
 -R $reference \
-V combined.g.vcf.gz \
-O final.vcf.gz

See the Best Practices guide for more detailed information.

ADD COMMENT
2
Entering edit mode
5.2 years ago

run a loop:

ls -1 all/*.sorted.bam | while read BAM
do
   (...)
    samtools index ${OUT_DIR}/${SAMPLE}.gatk.bam
   echo "${OUT_DIR}/${SAMPLE}.gatk.bam" >> all_bams.list
done

gatk --java-options "-Xmx4G" HaplotypeCaller -R ${REF_DIR}/ref.fa -I all_bams.list  -O ${OUT_DIR}/final.vcf

anyway, you'd better use a worklfow manager like snakemake or nextflow.

ADD COMMENT
0
Entering edit mode

Thank you, Pierre! I tried running:

INPUT_DIR=path
    REF_DIR=path

    INPUT=$(ls -1 all/*.sorted.bam | sed -n ${RUN}p)
    SAMPLE=$(basename "${INPUT}/" .sorted.bam)

    OUT_DIR=path

ls -1 all/*.sorted.bam | while read BAM
do

    java -jar picard.jar ValidateSamFile \
    I=${INPUT_DIR}/${SAMPLE}.sorted.bam \
    MODE=SUMMARY
    java -jar picard.jar AddOrReplaceReadGroups \
    I=${INPUT_DIR}/${SAMPLE}.sorted.bam \
    O=${OUT_DIR}/${SAMPLE}.gatk.bam \
    RGID=1 \
    RGLB=lib1 \
    RGPL=illumina \
    RGPU=unit1 \
    RGSM=20
    java -jar picard.jar ValidateSamFile \
    I=${OUT_DIR}/${SAMPLE}.gatk.bam \
    MODE=SUMMARY
    samtools index ${OUT_DIR}/${SAMPLE}.gatk.bam
    echo " ${OUT_DIR}/${SAMPLE}.gatk.bam" >> all_bams.list
done

But instead of reading all samples, it just reads the last sample in the directory.

ADD REPLY
0
Entering edit mode

move

INPUT=...

into the loop... of course

ADD REPLY
0
Entering edit mode

Hi Perre!

Even after moving INPUT= into the loop, it still considers only the last sample and generated both gatk.bam and index file for one sample only. Thank you!

ADD REPLY
0
Entering edit mode

show us the code...

ADD REPLY
0
Entering edit mode

Hello Pierre,

Here is the code:

INPUT_DIR=path
REF_DIR=path

ls -1 all/*.sorted.bam | while read BAM
do
INPUT=$(ls -1 all/*.sorted.bam | sed -n ${RUN}p)
SAMPLE=$(basename "${INPUT}/" .sorted.bam)

OUT_DIR=path

java -jar picard.jar ValidateSamFile \
I=${INPUT_DIR}/${SAMPLE}.sorted.bam \
MODE=SUMMARY
java -jar picard.jar AddOrReplaceReadGroups \
I=${INPUT_DIR}/${SAMPLE}.sorted.bam \
O=${OUT_DIR}/${SAMPLE}.gatk.bam \
RGID=1 \
RGLB=lib1 \
RGPL=illumina \
RGPU=unit1 \
RGSM=20
java -jar picard.jar ValidateSamFile \
I=${OUT_DIR}/${SAMPLE}.gatk.bam \
MODE=SUMMARY
samtools index ${OUT_DIR}/${SAMPLE}.gatk.bam
echo " ${OUT_DIR}/${SAMPLE}.gatk.bam" >> all_bams.list
done
gatk --java-options "-Xmx4G" HaplotypeCaller -R ${REF_DIR}/ref.fa -I all_bams.list  -O ${OUT_DIR}/final.vcf
ADD REPLY
0
Entering edit mode

INPUT should extract the data from the "${BAM}" variable.

At this point, I'm sorry, but I'll stop helping you.

ADD REPLY
0
Entering edit mode

Duplicate answer....

ADD REPLY

Login before adding your answer.

Traffic: 3772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6