I am performing a Minor Allele Frequency task on the cloud for my analysis.
vcftools --recode --recode-INFO-all --gzvcf /path/to/input.vcf --maf
0.01 --out output.maf.vcf > stdout.out
This process is taking exceedingly long (1 hour for a 30 GB Chr1 file) on a c5.4xlarge instance type. I thought about using threads or chunking or other subsetting data analysis techniques but have encountered trouble in the implementation. I read through the VCFTools documentation and could not find any threading / chunking that could be done within this method call.
Another approach I thought of was unzipping the GZipped file, then reading in only the genotypic information to a new vcf file and then filtering for MAF. This method does not seem like the most efficient manner to perform a MAF filter step.
Is there anything I am not considering while trying to speed up this process?
Thank you for your consideration