GATK GenomicsDBImport - use list as input
1
1
Entering edit mode
4.3 years ago
gabi ▴ 30

Hello,

I am using GenomicsDBImport, but since my input file is a file containing a list of file I don't know how to use this as input with GATK along with the -V option. I've been looking for this everywhere, but I can't find this information in the GATK manual.

gvcf_list=gvcf.listaa

java -jar gatk-package-4.1.4.0-local.jar GenomicsDBImport \
        -R HG38/hs38.fa \
        --genomicsdb-workspace-path /MY_DATABASE/$newdir \
        -V $gvcf_list \
        -L resources_broad_hg38_v0_wgs_calling_regions.hg38.interval_list

Thank you in advance

GATK germline VQSR GenomicsDBImport • 4.9k views
ADD COMMENT
0
Entering edit mode

Hi Bari.ballew,

We could also do this as:

for i in *.vcf.gz; do echo `bcftools query -l $i`;echo $i;done | paste - -

Either of the scripts gives me output as below:

sample_11       sample_11_HCcalls.g.vcf.gz
sample_23      sample_23_HCcalls.g.vcf.gz
sample_9 sample_9_HCcalls.g.vcf.gz
sample_45       sample_45_HCcalls.g.vcf.gz

It is probably is my OCD :) and pardon my basic question here but what is bothering me is that why is the tab spaced so wide in all samples other than sample_9? I checked it by opening the file in vi and they are all tabs and no spaces, but just that sample_9 is off! even if it was single digit I dont think it should be that way, is'nt it? I am worried if I dont fix this and I feed into GenomicsDBimport I may eventually get an error or a wacky DB and it would end up being more work. It is just bothering me!! Any input will help!! Thankyou

ADD REPLY
0
Entering edit mode

oops ! placed this in the wrong place. should be after Bari.ballew's comment ! my bad! but you get it .... :)

ADD REPLY
0
Entering edit mode

Sure! As the Perl programmers say, TIMTOWTDI!

It's because of the different number of characters. sample_9 has one less character than the other samples, so it hits the default tab spacing differently.

ADD REPLY
1
Entering edit mode
4.3 years ago
bari.ballew ▴ 470

Hi there! First, your sample file needs to be formatted like this: sample_name<tab>path/to/sample.vcf

If you have multi-sample VCFs (or gVCFs), you can generate a list of samples in each VCF using bcftools like this:

n=$(bcftools query -l <your.vcf>);'
echo "${n}\t<your.vcf>" > sample.map

For multiple VCFs, just cat together your sample map files.

Finally, you can run GenomicsDBImport like this (customize the options and resource allocation as needed):

gatk --java-options "-Xmx20G" GenomicsDBImport \
    --sample-name-map <sample.map> \
    --genomicsdb-workspace-path <output/path/for/database> \
    -L <interval> \
    --tmp-dir=<temp_directory>

Hope that helps!

ADD COMMENT

Login before adding your answer.

Traffic: 2661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6