Hi all!
I'm trying to annotate various big TCGA variant calling files using VEP, running the process in parallel in a cluster providing 200GB and 10 cores to each annotation process, but apparently it isn't enough so I thought of reducing the file size to increase the speed of the jobs.
I already obtained those .vcf separated by chromosome and by their FILTER (either "PASS" or ".") Also I've been following the recommendations from VEP's manual to make it run faster but VEP still takes a lot of time (many days) and it doesn't finish the annotation process even for small chromosomes like 22 with the resources mentioned above...
Has anyone tried to split the .vcf files by number of lines/size to make them smaller without losing potential data or with any other alternative method?
I know that .vcf format is a bit finicky and simply removing the header, splitting with bash and adding back the header wouldn't be the best solution for this issue.
Thanks in advance~!
Hi Pierre!
Thanks for the quick answer. I've been taking a look and playing around with
bedtools makewindows
but I believe it's not exactly what I need, maybe I missed out on explaining a bit more about my input.What I need to split due to (possible) size limitations are .vcf files obtained from TCGA, not a human genome that can be divided in equal-sized windows as I would lose the position (POS) of each variant called and VEP would not return the desired result which is adding more columns with extra information for each variant.
I have each .vcf separated by chromosome and I can also further do it by their FILTER ("PASS" or "."), but it doesn't seem to be enough to work correctly.
I'm looking for something akin to bash's
$split -n l/8 -a 1 chr.vcf chr.split_preffix
to divide each file into equal sized ones, but I've been advised to not remove the header and paste it after splitting with bash as that can tinker with the .vcf in a bad way, leading to some possible information loss.I've tried to find a tool that would split the .vcf in a similar fashion to taking the previous example and obtaining various files:
Maybe I'm using
bedtools makewindows
incorrectly but in my case I cannot make windows of equal size as each variant could be a SNP or an INDEL and it also doesn't identify possible windows to be created within my variants positions.I recently wrote https://lindenb.github.io/jvarkit/VcfSplitNVariants.html
Thanks Pierre, this has worked wonders!
After three weeks of differentes approaches VEP has finally been able to annotate my (now splitted) input files and I'm starting to get the results I wanted.
Thanks for all the help~~
Please accept the answer so the question is marked solved on the website. To do that, click on the green check mark on the left side of the answer.
Yep, sorry! I was looking for a way to exactly highlight the VcfSplitNVariants comment that was the final solution for my issue
i don't understand that sentence.
Sorry, I didn't explain myself properly here!
As my current working files weren't genome sequence but variant calling data,
bedtools makewindows
wasn't working properly.Each line includes either a SNP or an INDEL so creating small windows from the only available intervals (more than one base pair change) would result in "duplicating" some lines instead of reducing the filesize to launch smaller jobs for VEP to handle.
What I meant with
is that creating windows from a certain START - END position would alter the original information about the expanse of that given variant, the coordinates would be slightly different and the extra information obtained from VEP (more columns) would be duplicated in those cases.