Question

vcf file column name error

0

Entering edit mode

4.1 years ago

evafinegan • 0

Hello,

I have a vcf file and I did not have any ID for each of the SNP in that column. So I manually added unique IDs to the SNPs using:

awk '{OFS="\t"} NR<67 {print $0;next} {{$3=$1"_"$2} print}' sample.vcf > out.vcf

but it also changed the column name from ID to #CHROM_POS. Now I am getting an error

Error in x@fix[, "ID"] : subscript out of bounds

in the downstream analysis. I think its the replaced column names that's causing the error. Is there a way to keep the column name to ID in the awk command line? Thank you!

sequencing • 976 views

ADD COMMENT • link updated 4.1 years ago by Pierre Lindenbaum 166k • written 4.1 years ago by evafinegan • 0

score 0 · Answer 1 · 2021-02-18

0

Entering edit mode

4.1 years ago

Pierre Lindenbaum 166k

I have a vcf file and I did not have any ID for each of the SNP in that column.

bcftools annotate

Usage:   bcftools annotate [options] <in.vcf.gz>
(...)
   -I, --set-id [+]<format>       set ID column, see man page for details
(...)

if you really want awk:

awk '/^#/ {print;next} {OFS="\t";$3=sprintf("%s_%s",$1,$2); print}' sample.vcf

ADD COMMENT • link 4.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you! I used awk and now it gives this error: ID column contains non-unique names

ADD REPLY • link 4.1 years ago by evafinegan • 0

0

Entering edit mode

because using cols CHROM and POS is not enough (duplicates...). Try $3=sprintf("%s_%s_%d",$1,$2,NR)

ADD REPLY • link 4.1 years ago by Pierre Lindenbaum 166k