Entering edit mode
6.5 years ago
neeraj4biotech
•
0
I have a vcf file, have run SnpEff for annotation. I need to group these snps based on their belong genes. such as x, y and z snps belong to gene w, for all gene.
Are you trying to extract them into separate files per gene or are you trying to run a burden test or something sophisticated?
Thanks Vivek for quick response. Have vcf file and bed/gff file as input file. Actually I want separate files per gene.
There are more elegant solutions if you can do some scripting but here's a crude workflow:
If you have one line per gene in the bed file, you can initially split the BED file into one file per gene like this:
Depending on the number of genes, you might produce a lot of files here.
Rename to bed extension
Then use Tabix to split your VCF
It always helps if you can post some example data. Use datamash to group by gene and collapse all SNPs.
output:
input:
Install datamash either from here or from distro repos (for debian based; sudo apt install datamash -y; for conda, conda install datamash -y).
Neeraj, can you post few lines of the data? I know it should be a standard vcf, still it helps !