How to make bash script operate fast for a 13gb VCF.gz file with 900000 rows and 10000 fields?
5
1
Entering edit mode
6.4 years ago
DanielC ▴ 170

Dear Friends,

I mostly script in shell/bash as am comfortable with it and also use a lot of linux commands and Entrez utilities for various big file analysis; however, when I stared working on VCF files, dealing with million rows and columns I find issues in operating these files with bash. For instance, below is an example script to count number of SNPs for samples(TCGA...barcode...) in a VCF file:

for i in {10..10000}
do
snp_count=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep -v "TCGA" | grep "0/1\|1/1" | wc -l)
sample=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep "TCGA")
echo -e "$sample\t$snp_count\n"
done

Output:
TCGA...barcode1     60123
TCGA...barcode2      45245
.
.

The script works but it has been running for 2 days now, and will take time. I would really appreciate if you could share/suggest on the best ways to deal with VCF files using bash/shell or if another programming language could make a difference? However, if possible, I would prefer bash scripting ways to tackle VCF files.

Thanks, DK

vcf bash • 6.7k views
ADD COMMENT
7
Entering edit mode

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

But I think you should write a script to avoid read the data file thousands times.

ADD REPLY
1
Entering edit mode

At the very least just leave the file decompressed...

ADD REPLY
2
Entering edit mode

use the tools meant for vcf parsing (IMO).@OP

ADD REPLY
2
Entering edit mode

I don't understand why we see TCGA in your Output, because there is a grep -v "TCGA".

What is that awk "{print \$$i}" ?

ADD REPLY
0
Entering edit mode

@OP and @PIerre! OP's code and output are incorrect (IMO) for following reasons.

For eg in OP loop (as pointed out in your post):

sample=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep -v "TCGA" | grep "0/1\|1/1" | wc -l)

There are several problems with this code (IMO). But last one (wc -l) counts and lists the number. So variable sample is a number not a string. grep is done only for unphased genotypes.

snp_count=$(zcat xxx.vcf.gz | grep -v "##" | awk "{print \$$i}" | grep -v "TCGA")

The end command grep -v "TCGA prints the columns (from awk field) with column names not matching with string TCGA. This no count and snp_count variable should output columns not strings.

As I understand, OP is negating bash variable inside awk with {print \$$i} There are other ways to call bash variable inside awk.

Line echo -e "$sample\t$snp_count\n" as per the code, should output number first (wc -l from sample variable) and columns without sample string (TCGA) ( grep -v from sample_number). But in OP, output is other way around and is also incorrect as it should be printing columns, instead of strings. There is some thing wrong with code and output.

If OP is clear about what OP wants with example data and expected output, that would be more helpful.

ADD REPLY
0
Entering edit mode

Thanks! Made a mistake in putting the code here, I have corrected it.

ADD REPLY
2
Entering edit mode

Decompress the file will help. Set your system's locale to ASCII with LC_ALL=C should speed things up too. You might also be running out of memory.

I agree with @Shenwei356 that you should write a custom VCF parser to avoid reading the files so many times. You're only interested in the reading the letter once; why open the envelope so many times?

ADD REPLY
1
Entering edit mode

Hello DK,

I haven't such a big vcf file for testing here. So I don't know how my suggestions will perform.

First have a look at the output of bcftools stats if this maybe already fits your needs.

Another way I have tested with a small dataset here, is to extract the genotype for each sample, replace the genotype by 0 of it's hom REF or unknown otherwise by 1 to indicate that there is a variant. You will get a matrix where each column is for one sample and each row is for each variant.

1 1 1 
0 1 1 
1 1 1 
1 0 0 
1 1 1 
1 1 1 
1 1 1 
1 1 1 
0 0 1 
0 1 1

On can then sum up each column to get the number of variants in each sample.

$ bcftools query -f '[%GT ]\n' input.vcf|sed -E  's/(0\/0|\.)/0/g;s/[0-9]\/[1-9]/1/g'|datamash -t\  sum 1-3 > SNPcounts.txt

Using gnu parallel, like already suggested by the other, will improve the speed, if you do this for example for each chromosome.

fin swimmer

ADD REPLY
0
Entering edit mode

DK : You now have multiple answers that look interesting. Once you have had a chance to test (if you can do all that would be great) please come back and validate the ones that work (and others that don't) by accepting the answers.

ADD REPLY
0
Entering edit mode

I will certainly. Thanks!

ADD REPLY
3
Entering edit mode
6.4 years ago

Here is oneliner using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ wget  -O - "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz" |\
gunzip -c |\
java -jar dist/bioalcidaejdk.jar -e 'stream().flatMap(V->V.getGenotypes().stream()).filter(G->!(G.isHomRef()||G.isNoCall())).map(G->G.getSampleName()).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((k,v)->println(k+" "+v));' -F VCF
(...)
NA20852 441
HG02489 1233
HG02008 112
NA11919 905
HG02002 119
HG01398 133
HG03577 837
HG02009 1217
NA18549 819

detail:

  • stream(). get the stream of variants
  • flatMap(V->V.getGenotypes().stream()). convert it to a stream of genotypes
  • filter(G->!(G.isHomRef()||G.isNoCall())). remove the hom_ref and the nocall
  • map(G->G.getSampleName()). extract the sample name from the genotype
  • collect(Collectors.groupingBy(Function.identity(),Collectors.counting())). count the number of distinct names
  • forEach((k,v)->println(k+" "+v)); print it
ADD COMMENT
4
Entering edit mode
6.4 years ago

Hello again,

some more improvements on my previous code and an explanation to it. On my laptop with 4 cores and 16GB Ram the run time is < 7 minutes for the test dataset.

The Data

For testing I've download the vcf file for chromosome 1 with genotypes from the 1000 genomes project. This file is 1.1GB large and consists of around 6 500 000 variants in around 2 500 samples.

The columns look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 [...]
1       10177   rs367896724     A       AC      100     PASS    AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL GT      1|0     0|1
1       10235   rs540431307     T       TA      100     PASS    AC=6;AF=0.00119808;AN=5008;NS=2504;DP=78015;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0051;AA=|||unknown(NO_COVERAGE);VT=INDEL  GT      0|0     0|0

For the code it is important to see, that:

  • the sample data starts in column 10
  • the genotypes are given in phased format e.g. 0|0, 0|1
  • a NON-ALT genotype looks like this: 0|0

The Task

Create a list with the total count of variants for each sample in the vcf file. The output should look like this:

NA21141 133382
NA21142 131027
NA21143 130295
NA21144 131932

The Code

$ zcat input.vcf.gz|grep -v "#"|cut -f10-|parallel --pipe -N200000 -j4 'mawk -f count.awk'|mawk -v OFS="\t" -f sum.awk > snp_counts.txt

For readability reasons I've put the awk code in a file and haven't write it inline.

count.awk

{ for(i=1; i<=NF; i++) { if($i !~ /0|0/) snp[i]++ } }

END {for (id in snp) print id, snp[id]}

sum.awk

{snp[$1] += $2} END {for (id in snp) print id, snp[id]}

Let's go through the command line:

zcat input.vcf.gz: decompress the vcf file and print the data to stdout.

grep -v "#": We have to get rid of header lines.

cut -f10-: As we know that the samples data start at column 10 we can discard the columns before.

parallel --pipe -N200000 -j4 'mawk -f count.awk': We split up the list of variants in chunks of 200 000 lines and run the counting of variants in each sample in 4 threads in parallel.

awk -v OFS="\t" -f sum.awk: We combine the results of the parallel counting of variants by sum up the values for each sample.

What are the awk script doing?

{ for(i=1; i<=NF; i++) { if($i !~ /0|0/) snp[i]++ } }

We iterate over each sample column and check whether it contains 0|0. If it doesn't we assume that the sample has a variant here. We increment the value in the list, where the index is the same as the column number.

END {for (id in snp) print id, snp[id]}

If we reach the end, we print out each column number and the number of variants we have found.

{snp[$1] += $2}

We sum up the count for each column number.

{for (id in snp) print id, snp[id]}

And print out again the column number with the number of variants.

One last step is needed. We have to replace the column number with the sample names. There are many ways to do so with grep, awk, .... I choose this way:

paste <(zgrep -m 1 "#CHROM" input.vcf.gz|cut -f10-|tr "\t" "\n") <(sort -k1,1g snp_counts.txt|cut -f2) > result.txt

More Speed?

More speed is possible by increasing the number of parallel processes with the -j option. More parallel processes need more memory. So it might be necessary to adopt the chunk size (-N) or one can run out of memory.

The more parallel processes we start the more important it is to get the data fast enough. So decompressing the vcf file might become a critical step here. One can try replacing zcat with pigz as it allows parallel decompressing.

Good luck :)

fin swimmer

ADD COMMENT
0
Entering edit mode

zcat input.vcf.gz|grep "#CHROM" -m 1= zgrep "#CHROM" -m 1 input.vcf.gz?

ADD REPLY
0
Entering edit mode

There was a time when I knew about zgrep. But somehow it faded away ...

Post edited.

I should start closing my posts with "And cpad0112 will tell us how one can shorten the syntax" :)

ADD REPLY
1
Entering edit mode

@ finswimmer you are one of the fast and efficient responders here. In general, I get/understand things only after one of you post a solution

ADD REPLY
0
Entering edit mode

This is incredible! Thanks much Fin! I will implement this and all given suggestions and let you know :-)

ADD REPLY
0
Entering edit mode

Hi! Just wanted to update that, I have followed these steps on vcf file of 21000000 records and 10390 samples using "gnu parallel" and "mawk". I put such three commands (for three different conditions; to count 0/1, 1/1 and ./. separately for all 10390 samples) to run on 19th July and it's still running. I ran it like this in one bash file: (samples in my vcf file are "TCGA" barcodes)

# for 0/1
zcat input.vcf.gz | grep -v "##" | cut -f10- | grep -v "TCGA" | parallel --pipe -N200000 -j12 'mawk -f count.awk' | mawk -v OFS="\t" -f sum.awk > snp_counts1.txt & 

#for 1/1
zcat input.vcf.gz | grep -v "##" | cut -f10- | grep -v "TCGA" | parallel --pipe -N200000 -j12 'mawk -f count.awk' | mawk -v OFS="\t" -f sum.awk > snp_counts2.txt &

#for ./.
zcat input.vcf.gz | grep -v "##" | cut -f10- | grep -v "TCGA" | parallel --pipe -N200000 -j12 'mawk -f count.awk' | mawk -v OFS="\t" -f sum.awk > snp_counts3.txt &

I made "count.awk" file respective to each command, i.e for "0/1", if($i ~ 0/1)..similarly for ./. and 1/1.

I checked, the program is running and it is going to be three days. I was expecting it to finish by now. Please let me know if you have any comments. Thanks much!

ADD REPLY
0
Entering edit mode

Hello DK,

have it finished now? Was the result as expected?

What are you trying to do with grep -v "TCGA"?

Instead of running these 3 commands in parallel one could modify the awk script to count the genotypes in one run.

For testing you don't need to go through the whole file. Limit it to a number of variants and/or samples. you could use the -m option in grep to limit the number variants. The number of samples can be limited by the column numbers used in cut -f.

fin swimmer

ADD REPLY
0
Entering edit mode

Thanks for the reply Fin! I tested the approach on a very small sample set like 20 and it was very quick; basically I just wanted to see if the method works as I was convinced with the speed of the method from your method description and the results.

The script is still running since 19th July; since the method works and because I can see on the cluster where I have put it to run that the script is running, I think the results will come but curious to see how much more time it could take.

grep -v "TCGA" is to avoid the "TCGA..." barcode of the samples and count only the genotypes.

Could you please let me know how can I run these three commands in one command line? I would really appreciate.

Thanks, DK

ADD REPLY
0
Entering edit mode

Could you please let me know how can I run these three commands in one command line?

Modify count.awk to this (adopt the genotypes to your needs):

{for(i=1; i<=NF; i++) { 
    sample[i];
    if($i ~ /1\|1/) var_hom[i]++; 
    else if($i ~ /0\|1/) var_het[i]++; 
    else if($i ~ /\.\|\./) var_na[i]++; 
 } 
}

END {for (id in sample) {print id, var_hom[id], var_het[id], var_na[id]}}

And sum.awk to this:

{
sample[$1];
var_hom[$1] += $2;
var_het[$1] += $3;
var_na[$1] += $4;
} 
END {for (id in sample) {print id, var_hom[id], var_het[id], var_na[id]}}

I'm not fully satisfied with this solution. On my test dataset it needs around 14 minutes. I had expected more speed :(

grep -v "TCGA" is to avoid the "TCGA..." barcode of the samples and count only the genotypes.

I still didn't get this. With this you are skipping line that contain 'TCGA'. Do you want to remove the headerline which contains the sample name? Than just change grep -v "##" in your code to grep -v "#"

fin swimmer

ADD REPLY
0
Entering edit mode

Hey DK,

what is your script doing now?

I've found a way to increase the speed of the above script. This is how count.awk now looks like:

{for(i=1; i<=NF; i++) { 
    if($i ~ /0\|0/) continue
    else if($i ~ /0\|1/) var_hom[i]++; 
    else if($i ~ /1\|1/) var_het[i]++; 
    else if($i ~ /\.\|\./) var_na[i]++; 
 } 
}

END {for(id=1; id<=NF; id++) {print id, var_hom[id], var_het[id], var_na[id]}}

and sum.awk:

{
var_hom[$1] += $2;
var_het[$1] += $3;
var_na[$1] += $4;
} 
END {for (id in var_hom) {print id, var_hom[id], var_het[id], var_na[id]}}

First I've remove saving the column id's in a separate variable. There's no need for it, as we just count them. This allow decreases run time from 14 to 10 minutes!

Next thing is to bring the check of the genotype in an decreasing order of there frequency. Most samples will have genotype 0|0. If a sample have 0|0 we can go in immediately to the next column.

Now the run time is around 7 minutes.

I love these run time challenges. :)

fin swimmer

ADD REPLY
0
Entering edit mode

Hey Fin! Your solutions really help and are very helpful! I tried the new solution you provided without "parallel" and it works, because I think there is something wrong with the parallel usage I have in my script I mentioned in the reply above; because the script is still running since 19th July on 21000000 and 1039o samples, and to my surprise the cluster where it is running does not show any error and says it is using some 64GB ram with 54 threads in use. I am just not able to understand why the script does not finish when there is no error. Now, what I am going to do is try the same script with parallel on a 50sample vcf file and see what happens. I think there is something wrong because, if I had run a normal command line with same commands with mawk it would have finished way by now. I will let you know how my experiment goes and in the meantime if you could tell me what you think may be going wrong with my script above, especially with parallel?

I still didn't get this. With this you are skipping line that contain 'TCGA'. Do you want to remove the headerline which contains the sample name? Than just change grep -v "##" in your code to grep -v "#"

Yes, I could do this too.

Thanks! DK

ADD REPLY
2
Entering edit mode
6.4 years ago

This depends on implementation details of the programs you are calling with your bash script.

You should time ("profile") the components of your script on a smaller test dataset. Most likely, some steps will be very fast and you don't need to worry about them at all, and at least one step will be... not so fast. You can then look for faster replacements for the problematic components (cough rhymes with 'talk' ahem).

In this particular case, as another commenter has pointed out, the outer for loop is the biggest problem of all. Look for a way to avoid repeating every step 9991 times...

ADD COMMENT
2
Entering edit mode
6.4 years ago
ATpoint 85k

I have no script ready for this but here are some inspirations:

You could index your file with Tabix. This allows fast retrieval of e.g. variants per chromosome. Write a script that subsets the vcf.gz per chromosome and pipe the output to your SNP-count script. To be efficient, I would use GNU parallel to run this for all chromosomes in parallel, given you have enough cores for this. Could be something like:

## Index:
tabix -p vcf big.vcf.gz
## Get entries per chromosome, including VCF header (-h) in parallel:
cat chromosomes.txt | parallel "tabix -h big.vcf.gz {} | ./your_script.sh"

The chromosomes.txt would have the 1-based range for each chromosome in the format:

chr1:start-end

chr2:start-end

<(...)

Also replace awk by mawk, which is considerably faster.

ADD COMMENT
1
Entering edit mode
6.4 years ago

I worked a bit more on this with the help of the 1000 Genomes vcf genotype file for chromosome 1. This contains about 2500 columns and 6 500 000 rows. This here runs about 25 minutes on my laptop.

zcat ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz|cut -f10-|grep -v "#"|awk '{for(i=1; i<=NF; i++) {if($i !~ "0|0") {j[i]++}}} END {for (k in j) print j[k]}'

fin swimmer

ADD COMMENT
2
Entering edit mode

You should consider switching from awk to mawk for operations on large datasets. The increase in speed is amazing. Not sure if that holds true in this example because of the decompression at the beginning that might be the limiting factor, but in general, give mawk a try. You'll not regret it.

ADD REPLY
1
Entering edit mode

Wooohaaaa! mawk! Running the above code with mawk melt down the runtime to 17 minutes. Amazing!

But mawk handles list or looping over lists in another way. The output is not sorted in the same way. We have to take care about this.

I've found a way to decrease the runtime to <7 minutes. I'll prepare the answer in some minutes.

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you all! Great advices! I will implement these and let you know the result. Thanks much again!

ADD REPLY

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6