Question

skipping column names using awk

0

Entering edit mode

5.3 years ago

evelyn ▴ 230

Hello Everyone,

I have found posts to use awk, grep for skipping column names and perform an awk operation for the rest of the file. But I could not find it useful for what I am trying to do. I have merged three vcf files and translated the genotype numbers to calls using:

bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n' in.vcf | awk -v FS="\t" -v OFS="\t" '{for(i=6;i<=NF;i++) {split($i, gt, "/"); if(gt[1]==".") $i="-"; else if(gt[1]==gt[2]) $i=gt[1]; else $i="N";} print }' > out.vcf

Here, else $i="N" replaces the column names with N as well like this:

# [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  N   N   N

However, I want to keep those column names (based on my file names) for further analysis like this:

# [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  file1 file2 file3

I will appreciate your time and effort for any help. Thank you!

names • 1.7k views

ADD COMMENT • link updated 5.3 years ago by finswimmer 16k • written 5.3 years ago by evelyn ▴ 230

0

Entering edit mode

bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n' in.vcf | awk -v FS="\t" -v OFS="\t" '{for(i=6;i<=NF;i++) {split($i, gt, "/"); if(gt[1]==".") $i="-"; else if(gt[1]==gt[2]) $i=gt[1]; else $i="N";} print }' > out.vcf

Can you please explain what are you trying to achieve?

ADD REPLY • link 5.3 years ago by lakhujanivijay 5.9k

0

Entering edit mode

I merged vcf files and then called the genotypes. Then changed the 1/1 0/1 0/0 ./. genotype calls to bases using awk. It changes heterozygous bases to (0/1 or 1/0) to 'N' using else $i="N". But I want to keep the column names which are also changed to 'N'. Thank you!

ADD REPLY • link 5.3 years ago by evelyn ▴ 230

score 3 · Answer 1 · 2019-08-11

You can get awk to behave differently for different rows by using conditional statements based on either line number:

(NR==1) {do something} (NR>1) {do something else}

Or by a characteristic of the line:

$0~/^#CHROM/ {do something} $0!~/^#CHROM/ {do something else}

HOWEVER, I strongly recommend sticking with the canonical VCF format if at all possible. Parsing a VCF on your own is always more complex than it seems at first. For instance, do you have any multi-allelic sites? If so, you need to make sure that your output file shows the correct alt allele for each genotype (e.g. ref=A, alt=G,C, genotypes could be 0/1 meaning AG, or 0/2 meaning AC, or 1/2 meaning GC, etc.).

Additionally, since it looks as though you've merged VCFs and not gVCFs, please bear in mind that VCFs only report a genomic location if there is a variant in that individual, so you are susceptible to a missing data problem. When a variant is reported in A.vcf, but not in B.vcf, the merged file will record the variant as missing "./." for sample B. Does that mean there was insufficient coverage to make a call, or was there plenty of coverage and simply no variant reads?

score 1 · Answer 2 · 2019-08-11

1

Entering edit mode

5.3 years ago

finswimmer 16k

You can extract the header line from your original vcf and cut to the needed columns. Afterward you can combine it with you r file (which is no valid vcf file anymore!)

$ grep -m1 "^#CHROM" input.vcf | cut -f1-5,10- | cat - output.vcf > newfile

I absolute supports bari.ballew note to use a standardized format like vcf wherever possible and not switch to something custom if not absolutely needed.