replace content of a column by the content of a variable
2
0
Entering edit mode
5.5 years ago

Hello,

my final goal is to replace, for each line of the GQPDOMB_impute_copie.vcf file, the info column (column 8 of the GQPDOMB_impute_copie.vcf file) by the contents of columns 2 and 3 of the formatting.txt file:

That's my idea:

For each line of the GQPDOMB_impute_copie.vcf file
do 
    the variable rs retrieves the rsID of the current line in column 3 of the GQPDOMB_impute_copie.vcf file
    The variable VAR1 searches for the content of the variable rs in the formatting.txt file for each line
    if the variable is not empty (the content of rs for this line has been found in the formatting.txt file)
    so
        the ra variable recovers the contents of columns 2 and 3 of the formatting.txt file
        The content of column 8 of the current row is replaced by the content of the variable ra (which contains the information contained in columns 2 and 3 of formatting.txt)
        fi
done

GQPDOMB_impute_copie.vcf :

##fileformat=VCFv4.3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AERMQNK-paris-8400326-P6-recipient_AERMQNK-paris-8400326-P6-recipient   ....
1   783071  rs142849724 C   T   .   PASS    TYPED;RefPanelAF=0.018571;AN=80;AC=5;INFO=1 GT  0|0 0|0 1|0 0|0 1|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0 0|1 0|0 0|0 0|0
1   783186  rs141989890 G   C   .   PASS    RefPanelAF=0.000323375;AN=80;AC=0;INFO=1    GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
1   783632  rs193023236 G   A   .   PASS    RefPanelAF=0.00040037;AN=80;AC=0;INFO=1 GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0

formating.txt :

rs142849724;ENSG00000228794;ENST00000624927|ENST00000623808|ENST00000445118|ENST00000448975|ENST00000610067|ENST00000608189|ENST00000609139|ENST00000449005|ENST00000416570|ENST00000623070|ENST00000609009|ENST00000622921
rs141989890;ENSG00000228794;ENST00000624927|ENST00000623808|ENST00000445118|ENST00000448975|ENST00000610067|ENST00000608189|ENST00000609139|ENST00000449005|ENST00000416570|ENST00000623070|ENST00000609009|ENST00000622921
rs193023236;ENSG00000228794;ENST00000624927|ENST00000623808|ENST00000445118|ENST00000448975|ENST00000610067|ENST00000608189|ENST00000609139|ENST00000449005|ENST00000416570|ENST00000623070|ENST00000609009|ENST00000622921

After a lot of research on the internet, here is the code I can offer you:

#!/bin/bash

while read line
do
    rs=$(awk -F '\t' '{print $3}' GQPDOMB_impute_copie.vcf)     #recovery rsID
    VAR1=$(grep "${rs}" formating.txt)      #we check if the rsID of the current line is found in the file formatting.txt
    if [ -n "$VAR1" ] ;     #if the rsID of the current line has been found
    then
        ra=$(grep "${rs}" formating.txt | awk -F ';' '{print $2,";",$3}')   #recovery of the contents of columns 2 and 3 of the formating.txt file in the same vaiable  
        awk -F '\t' -v t="\"$ra\"" '{$8=t; print }' OFS='\t' GQPDOMB_impute_copie.vcf   #replace the content of the column 8 (info) with the content of the prévious var
    fi
done < GQPDOMB_impute_copie.vcf

However, I think the program does not read the vcf file line by line and does not succeed in creating the variable VAR1. Here is the error that was returned to me:

./script-info.sh: line 16: /usr/bin/grep: Argument list too long
./script-info.sh: line 16: /usr/bin/grep: Argument list too long
./script-info.sh: line 16: /usr/bin/grep: Argument list too long

How to succeed in creating this script and if possible as efficiently as possible?

I thank you for that.

bash awk grep • 1.1k views
ADD COMMENT
0
Entering edit mode
5.5 years ago
AK ★ 2.2k

Hi amandinelecerfdefer,

You can try:

sed "s/;/\t/" formating.txt > formating_re.txt
awk -F '\t' 'BEGIN{OFS=FS} FILENAME=="formating_re.txt" {info[$1]=$2; next} $0!~/^#/ && info[$3]!="" {$8=info[$3]; print}' formating_re.txt GQPDOMB_impute_copie.vcf

This considers and only prints out the lines in GQPDOMB_impute_copie.vcf when the rsID from that line can be found in formating_re.txt.

ADD COMMENT
0
Entering edit mode
5.5 years ago

Your solution doesn't work :/ but I find an other solution :

BEGIN   { FS=";" }
NR==FNR { val[$1] = $2 FS $3; next }
FNR==1  { FS=OFS="\t"; $0=$0 }
!/^#/   { $8 = ($3 in val ? val[$3] : $8) }
{ print }
ADD COMMENT

Login before adding your answer.

Traffic: 2119 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6