I have a zebrafish SNP vcf file at hand; in the genotype field, there are information for six strains. However, I'm only interested in TA and TUB strain. Then how can I select these two columns only?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AB TLF TUB TUG WKB WKG
chr10 587 . G A,C 106 PASS AF1=0.9999;CI95=0.5833,1;DP=9;DP4=0,0,1,5;FQ=-29.7;MQ=60 GT:PL:GQ 1/1:35,3,0,35,3,35:39 1/1:0,0,0,0,0,0:36 1/1:44,6,0,44,6,44:42 1/1:45,11,8,37,0,34:39 1/1:0,0,0,0,0,0:36 1/1:26,3,0,26,3,26:39
I tried to used vcf-subset
; but doesnt' work producing "multiple read header error"
Or can I simply use some bash/python to manipulate?
Also, if I wanna select against the genotype, say, to discard the SNP where genotype for both TA and TUB are "0/0". What shall I do? thx
EDIT:
THank you guys.
The problem is actually quite silly. The vcf files I got contains multiple header. Seems my collaborator simply concatenate several vcf files using cat instead of merging them. Then I got errors in the processing using both vcftools-subset
and python.
Both python and vcftools can easily solve these problem.
So take my lesson to double-check the original vcf file before any processing
Probably just going to be fastest to script it up. You can also use awk '{print $your.column}' file.vcf
Since you working with large files, you should consider to take a look at unix tutorial.
Another approach is to use Galaxy - take a look at tutorials for text parsing
thx...but seems egrep -v '0/0s+0/0' doesn't work.....actually I don't know how to grep such complex pattern
It would probably be worth figuring out where the error is coming from with vcf-subset as that suggests you have a poorly formatted vcf file