skipping column names using awk
2
0
Entering edit mode
5.3 years ago
evelyn ▴ 230

Hello Everyone,

I have found posts to use awk, grep for skipping column names and perform an awk operation for the rest of the file. But I could not find it useful for what I am trying to do. I have merged three vcf files and translated the genotype numbers to calls using:

bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n' in.vcf | awk -v FS="\t" -v OFS="\t" '{for(i=6;i<=NF;i++) {split($i, gt, "/"); if(gt[1]==".") $i="-"; else if(gt[1]==gt[2]) $i=gt[1]; else $i="N";} print }' > out.vcf

Here, else $i="N" replaces the column names with N as well like this:

# [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  N   N   N

However, I want to keep those column names (based on my file names) for further analysis like this:

# [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  file1 file2 file3

I will appreciate your time and effort for any help. Thank you!

names • 1.7k views
ADD COMMENT
0
Entering edit mode
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n' in.vcf | awk -v FS="\t" -v OFS="\t" '{for(i=6;i<=NF;i++) {split($i, gt, "/"); if(gt[1]==".") $i="-"; else if(gt[1]==gt[2]) $i=gt[1]; else $i="N";} print }' > out.vcf

Can you please explain what are you trying to achieve?

ADD REPLY
0
Entering edit mode

I merged vcf files and then called the genotypes. Then changed the 1/1 0/1 0/0 ./. genotype calls to bases using awk. It changes heterozygous bases to (0/1 or 1/0) to 'N' using else $i="N". But I want to keep the column names which are also changed to 'N'. Thank you!

ADD REPLY
3
Entering edit mode
5.3 years ago
bari.ballew ▴ 470

You can get awk to behave differently for different rows by using conditional statements based on either line number:

(NR==1) {do something} (NR>1) {do something else}

Or by a characteristic of the line:

$0~/^#CHROM/ {do something} $0!~/^#CHROM/ {do something else}

HOWEVER, I strongly recommend sticking with the canonical VCF format if at all possible. Parsing a VCF on your own is always more complex than it seems at first. For instance, do you have any multi-allelic sites? If so, you need to make sure that your output file shows the correct alt allele for each genotype (e.g. ref=A, alt=G,C, genotypes could be 0/1 meaning AG, or 0/2 meaning AC, or 1/2 meaning GC, etc.).

Additionally, since it looks as though you've merged VCFs and not gVCFs, please bear in mind that VCFs only report a genomic location if there is a variant in that individual, so you are susceptible to a missing data problem. When a variant is reported in A.vcf, but not in B.vcf, the merged file will record the variant as missing "./." for sample B. Does that mean there was insufficient coverage to make a call, or was there plenty of coverage and simply no variant reads?

ADD COMMENT
0
Entering edit mode

Thank you for your suggestion! I agree parsing a vcf file makes it complicated especially for heterozygous bases so I am using 'N' for all such cases as I do not need to use heterozygous calls for my downstream analysis.

ADD REPLY
1
Entering edit mode
5.3 years ago

You can extract the header line from your original vcf and cut to the needed columns. Afterward you can combine it with you r file (which is no valid vcf file anymore!)

$ grep -m1 "^#CHROM" input.vcf | cut -f1-5,10- | cat - output.vcf > newfile

I absolute supports bari.ballew note to use a standardized format like vcf wherever possible and not switch to something custom if not absolutely needed.

Just a note: This is a follow up question of this thread.

ADD COMMENT
0
Entering edit mode

Thank you for your time and effort @finswimer @bari.ballew!

ADD REPLY

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6