Entering edit mode
4.8 years ago
evelyn
▴
230
Hello,
I extracted a specific SNP
from vcf with multiple samples using tabix
and then I realized there are some sample names and their respective calls do not have a tab space between them and appear to be in one cell in excel. Is there a way in linux to solve this problem rather than doing it manually? e.g., these two calls have no tab space in vcf and in excel they appear in one cell rather than two cells:
0/0:10:11,0:11:379:0:0:0,-3.31,-34.44 0/0:9:9,0:9:337:0:0:0,-2.70,-30.6
Thank you!
VCF values are always separated by tabs, so the problem is probably related to copy-pasting from Linux to Windows. You could probably get around it by copying the file with some software like mobaXterm or WinSCP (rather than text copy-paste), or by doing the analysis on Linux (without using Excel). Can you explain what you're trying to achieve exactly?
I tried to view it in Text wrangler and then copy in excel but it's still the same. I want to see if there is any significant difference between reference and alternative calls between samples for one SNP. Traditionally, I copy the SNP information in excel, convert the calls to R (0/0), A (0/1) and H (1/1) and run a t-test.
You could use
bcftools query
to get what you need.bcftools query -f [%GT\t]\n
would get you all the GT information tab separated from the VCF. This result might be easier to work with on Excel.I don't see how a VCF was generated with mixed white spaces. Something probably went wrong during a copy-paste operation. For example, iTerm2 and TextWrangler (I think) have options to auto-convert tabs to spaces. Use the shell directly and skip any copy-paste operations for the time being.
I used bcftools query and it changed it to
C/C C/C C/C C/C
format. how can I change calls0/0
to R,0/1
to H and1/1
to A without changing vcf file format. I did it with awk earlier but it changes the file format.You probably used
%TGT
and not%GT
. The latter would give youFORMAT/GT
as-is.You want the information, the format does not matter. VCF files and Excel don't get along well. Use the tabular output from
bcftols query
, forget about retaining VCF format.Ultimately, I wanted the translated version of calls to do the significance test.
If you wish to have your cake and eat it too, you'll need the
%REF
and%ALT
fields, and then you'll need to look for corresponding REF/REF, REF/ALT and ALT/ALT combinations withawk
.Yes, I have REF and ALT fields. I wanted to change all the calls to R, H and A based on their respective REF and ALT fields. I tried awk but it changes to single letter REF and ALT calls only instead of R and A. Moreover, I am not sure how to deal with hetero calls.
e.g.,
ch1 109 . A C A A A A A A A/C C C C C . A A
However, I want it like:ch1 109 . A C R R R R R R H A A A A . R R
Compare each
NF>5
field to$3
and$4
. If$6
(for example) is$3/$3
, replace it with R. Do a similar thing to getH
andA
.