vcf split by columns
6
0
Entering edit mode
8.0 years ago

Well I hope I can explain myself well,

The vcf files I got have in the INFO column all data together (AF, MLLD,SVTYPE, GENE...) split by "; "

when I put on excel and try to split in columns, the columns dont follow each other like it should be, because all data isnt equal.

Is any easy way to do this? so I dont need to fix it manually?

I hadnt done this file but Im wondering mostly all VCF are like that??? instead have on the head the requiered info? (precise, svtype, func....)

CHROM   POS ID  REF ALT QUAL    FILTER  INFO      FORMAT

chr1    68928   .   T   <CNV>   100.0   PASS    PRECISE=FALSE;SVTYPE=CNV;END=69412;LEN=484;NUMTILES=4;CONFIDENCE=0;PRECISION=2.04702;FUNC=[{'gene':'OR4F5'}]    GT:GQ:CN    ./.:0:2

chr1    871334  .   G   T   431.84  PASS    AF=0.488263;AO=104;DP=213;FAO=104;FDP=213;FR=.;FRO=109;FSAF=68;FSAR=36;FSRF=67;FSRR=42;FWDB=0.00178582;FXX=0;HRUN=2;LEN=1;MLLD=52.4929;QD=8.10962;RBI=0.00909038;REFB=-0.0299934;REVB=-0.00891325;RO=109;SAF=68;SAR=36;SRF=67;SRR=42;SSEN=0;SSEP=0;SSSB=0.0396347;STB=0.521838;STBP=0.536;TYPE=snp;VARB=0.0325404;OID=.;OPOS=871334;OREF=G;OALT=T;OMAPALT=T;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]   GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:213:213:109:109:104:104:0.488263:36:68:67:42:36:68:67:42

Thanks

vcf • 8.5k views
ADD COMMENT
1
Entering edit mode

While the split can be done using sed etc you would have the same problem (of unequal # of columns) if you try to import the data into excel :)

ADD REPLY
0
Entering edit mode

Thanks genomax....

well..the fact is I dont want to have on excel, I thought use excel and get back my vcf file with the modify head and all INFO splited by columns

ADD REPLY
3
Entering edit mode
8.0 years ago

here's a bcftools alternative. if you are interested in a set of variables rather than all of them, bcftools query can help you if you play around with the %INFO/TAG variable:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/AC\t%INFO/AF\n' file.vcf.gz

but if you need all the variables present in the INFO column, then you'll have to read the file first to know which fields are present in it, and then you'll have to query the file using those fields previously extracted. if the vcf is well defined you can extract them out of the file headers:

FIELDS=$(bcftools view -h file.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT$FIELDS\n" file.vcf.gz

if not, you'll have to read the whole file looking for all the fields, and then query the file:

FIELDS=$(zgrep -v ^# file.vcf.gz \
| head -n 1000 \
| perl -ane '
while ($F[7] =~ /([^=;]+)=/g) { $fs{$1} = 1 }
END { for $f (sort keys %fs) { print "\\t\%INFO/$f" } }')
bcftools query -Hf "%CHROM\t%POS\t%REF\t%ALT$FIELDS\n" file.vcf.gz

the 1000 lines head is there just in case the file you're querying is too big, assuming that the first 1000 lines would contain all the present fields. you can remove it to be sure that you're working will absolutely all fields, but it can take time to process large files.

ADD COMMENT
0
Entering edit mode

thanks to all!

today that I have free day from work so I can try all your tips...thank you so much. Sadly yesterday my head wasn't in good shape :P

ADD REPLY
0
Entering edit mode

well I used Jorge'ś code, works like charm :)))

FIELDS=$(bcftools view -h result.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER$FIELDS\n" result.vcf.gz > RESULTSexcel.vcf

Just one small thing, well two....In my excel table doesnt appear the field where says what is each column, would be really nice to have that info :) . And dunno why (probably I wrote wrong st) the column FORMAT dont appear.

my head: CHROM  POS  ID  REF ALT QUAL  FILTER   INFO     FORMAT

I tried also:

FIELDS=$(bcftools view -h result.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FORMAT\t%FILTER$FIELDS\n" V42.vcf.gz > RESULTSSSexcel

Error: no such tag defined in the VCF header: INFO/FORMAT

Thanks for all the help!!!

ADD REPLY
0
Entering edit mode

the first one is my fault. in order to get the headers from bcftools query you have to add option -H:

FIELDS=$(bcftools view -h result.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER$FIELDS\n" result.vcf.gz > RESULTSexcel.vcf

the second one is your fault. firstly, you have to use the same vcf file name both at the $FIELDS definition and at the bcftools query; secondly, FORMAT is indeed not a column that bcftools query prints directly. it uses that column to print sample information such as genotype, depth,...

FIELDS=$(bcftools view -h V42.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -Hf "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FORMAT\t%FILTER$FIELDS\n" V42.vcf.gz > RESULTSSSexcel
ADD REPLY
0
Entering edit mode

thanks for the information but then whats happen with the column that has the information of homozygous (0:0)...hetero....?

How I can get that field? I mean after INFO I have two colums more, I asume Format is the first one, whats happens with the second above?

GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   

0/1:99:165:164:83:83:81:81:0.493902:31:50:52:31:31:50:52:31

THANKS!! (y gracias para que nos entendamos....que mi ingles...)

ADD REPLY
1
Entering edit mode

the FORMAT column just specifies what type of data do the samples columns contain. you can query any of that FORMAT column fields by calling them into brackets (everything is perfectly described in the bcftools webpage). for instance, if you want to get the genotypes for all samples you'll just have to add [ %GT] at the end of the query:

FIELDS=$(bcftools view -h V42.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -Hf "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FORMAT\t%FILTER$FIELDS[\t%GT]\n" V42.vcf.gz > RESULTSSSexcel
ADD REPLY
0
Entering edit mode

WELL, now I start to understand a bit more how this works, I had managed to even add a column on my own just with the gene info :)))

The only thing I dont know is how to make appear the column with 0/0 1/1 just appear the value zero (supose to be GT column name right?)

thanks

ADD REPLY
0
Entering edit mode

Hi! There is one more query related to this.I have more information under ANN (also comes under INFO), you can say sub-variables under ANN variable, how can I split them into separate columns by using bcftools query? The above code worked fine for 1st row of headers but now having problem by dealing with information under ANN. Also my file is a big file (3.88GB).

Thanks

ADD REPLY
0
Entering edit mode

Hi! There is one more query related to this.I have more information under ANN (also comes under INFO), you can say sub-variables under ANN variable, how can I split them into separate columns by using bcftools query? The above code worked fine for 1st row of headers but now having problem by dealing with information under ANN. Also my file is a big file (3.88GB).

Thanks

ADD REPLY
2
Entering edit mode
8.0 years ago

try to use GATK VariantsToTable https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php to convert your VCF to a tabular format.

ADD COMMENT
0
Entering edit mode

thanks Pierre, I will try it , hope so, my head is popping out today XD

ADD REPLY
1
Entering edit mode
8.0 years ago

I never open VCF data in Excel. Here is an R based solution using fread() that works fast on large VCF files. You just need to know how many lines the header takes up before the essential line #CHROM POS...etc.

require(data.table)
vcf.data <- fread("file.vcf", header=T, skip=10L, data.table=F, sep="\t")

To extract just the INFO column and split its elements by semicolon:

require(stringr)
info.df <- data.frame(do.call(rbind, strsplit(as.vector(vcf.data$INFO), split = ";")))
ADD COMMENT
1
Entering edit mode

thanks to all!

today that I have free day off work can try all your tips...thank you so much :))))

ADD REPLY
1
Entering edit mode
8.0 years ago
Vitis ★ 2.6k

For relatively small queries, for example, limited number of SNP sites, I usually use PyVCF, which is the python interface to VCF. Programing with PyVCF can give you almost everything from the VCF. However, it doesn't work very well with super-large multi-sample VCF files. The nice thing about PyVCF also lies you could embed the code in your python applications.

https://github.com/jamescasbon/PyVCF

ADD COMMENT

Login before adding your answer.

Traffic: 2299 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6