Hello again,
some more improvements on my previous code and an explanation to it. On my laptop with 4 cores and 16GB Ram the run time is < 7 minutes for the test dataset.
The Data
For testing I've download the vcf file for chromosome 1 with genotypes from the 1000 genomes project. This file is 1.1GB large and consists of around 6 500 000 variants in around 2 500 samples.
The columns look like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 [...]
1 10177 rs367896724 A AC 100 PASS AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL GT 1|0 0|1
1 10235 rs540431307 T TA 100 PASS AC=6;AF=0.00119808;AN=5008;NS=2504;DP=78015;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0.0051;AA=|||unknown(NO_COVERAGE);VT=INDEL GT 0|0 0|0
For the code it is important to see, that:
- the sample data starts in column 10
- the genotypes are given in phased format e.g.
0|0
, 0|1
- a NON-ALT genotype looks like this:
0|0
The Task
Create a list with the total count of variants for each sample in the vcf file. The output should look like this:
NA21141 133382
NA21142 131027
NA21143 130295
NA21144 131932
The Code
$ zcat input.vcf.gz|grep -v "#"|cut -f10-|parallel --pipe -N200000 -j4 'mawk -f count.awk'|mawk -v OFS="\t" -f sum.awk > snp_counts.txt
For readability reasons I've put the awk
code in a file and haven't write it inline.
count.awk
{ for(i=1; i<=NF; i++) { if($i !~ /0|0/) snp[i]++ } }
END {for (id in snp) print id, snp[id]}
sum.awk
{snp[$1] += $2} END {for (id in snp) print id, snp[id]}
Let's go through the command line:
zcat input.vcf.gz
: decompress the vcf
file and print the data to stdout
.
grep -v "#"
: We have to get rid of header lines.
cut -f10-
: As we know that the samples data start at column 10 we can discard the columns before.
parallel --pipe -N200000 -j4 'mawk -f count.awk'
: We split up the list of variants in chunks of 200 000 lines and run the counting of variants in each sample in 4 threads in parallel.
awk -v OFS="\t" -f sum.awk
: We combine the results of the parallel counting of variants by sum up the values for each sample.
What are the awk script doing?
{ for(i=1; i<=NF; i++) { if($i !~ /0|0/) snp[i]++ } }
We iterate over each sample column and check whether it contains 0|0
. If it doesn't we assume that the sample has a variant here. We increment the value in the list, where the index is the same as the column number.
END {for (id in snp) print id, snp[id]}
If we reach the end, we print out each column number and the number of variants we have found.
{snp[$1] += $2}
We sum up the count for each column number.
{for (id in snp) print id, snp[id]}
And print out again the column number with the number of variants.
One last step is needed. We have to replace the column number with the sample names. There are many ways to do so with grep
, awk
, .... I choose this way:
paste <(zgrep -m 1 "#CHROM" input.vcf.gz|cut -f10-|tr "\t" "\n") <(sort -k1,1g snp_counts.txt|cut -f2) > result.txt
More Speed?
More speed is possible by increasing the number of parallel processes with the -j
option. More parallel processes need more memory. So it might be necessary to adopt the chunk size (-N
) or one can run out of memory.
The more parallel processes we start the more important it is to get the data fast enough. So decompressing the vcf file might become a critical step here. One can try replacing zcat
with pigz as it allows parallel decompressing.
Good luck :)
fin swimmer
Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them
But I think you should write a script to avoid read the data file thousands times.
At the very least just leave the file decompressed...
use the tools meant for vcf parsing (IMO).@OP
I don't understand why we see
TCGA
in yourOutput
, because there is agrep -v "TCGA"
.What is that
awk "{print \$$i}"
?@OP and @PIerre! OP's code and output are incorrect (IMO) for following reasons.
For eg in OP loop (as pointed out in your post):
There are several problems with this code (IMO). But last one (wc -l) counts and lists the number. So variable sample is a number not a string. grep is done only for unphased genotypes.
The end command
grep -v "TCGA
prints the columns (from awk field) with column names not matching with stringTCGA
. This no count and snp_count variable should output columns not strings.As I understand, OP is negating bash variable inside awk with
{print \$$i}
There are other ways to call bash variable inside awk.Line
echo -e "$sample\t$snp_count\n"
as per the code, should output number first(wc -l from sample variable)
and columns without sample string (TCGA)( grep -v from sample_number)
. But in OP, output is other way around and is also incorrect as it should be printing columns, instead of strings. There is some thing wrong with code and output.If OP is clear about what OP wants with example data and expected output, that would be more helpful.
Thanks! Made a mistake in putting the code here, I have corrected it.
Decompress the file will help. Set your system's locale to ASCII with
LC_ALL=C
should speed things up too. You might also be running out of memory.I agree with @Shenwei356 that you should write a custom VCF parser to avoid reading the files so many times. You're only interested in the reading the letter once; why open the envelope so many times?
Hello DK,
I haven't such a big
vcf
file for testing here. So I don't know how my suggestions will perform.First have a look at the output of bcftools stats if this maybe already fits your needs.
Another way I have tested with a small dataset here, is to extract the genotype for each sample, replace the genotype by
0
of it'shom REF
or unknown otherwise by1
to indicate that there is a variant. You will get a matrix where each column is for one sample and each row is for each variant.On can then sum up each column to get the number of variants in each sample.
Using
gnu parallel
, like already suggested by the other, will improve the speed, if you do this for example for each chromosome.fin swimmer
DK : You now have multiple answers that look interesting. Once you have had a chance to test (if you can do all that would be great) please come back and validate the ones that work (and others that don't) by
accepting
the answers.I will certainly. Thanks!