Awk- How to compare 2 files with common columns and then get the output file with columns from each file.
0
0
Entering edit mode
8.1 years ago

Hi folks

There is two input files.(File formats is tsv, both)

File1: treated.bam.tsv File2: untreated.bam.tsv

Both has ‘same fields’ like below. (I numbered and delimited with ‘/’) – each file has 23 fields.

1chrom/2pos/3ref/4reads_all/5reads_pp/6matches/7matches_pp/8mismatches/9mismatches_pp/10deletions/11deletions_pp/12insertions/13insertions_pp/14A/15A_pp/16C/17C_pp/18T/19T_pp/20G/21G_pp/22N/23N_pp

If values in 1st and 2nd columns(chrom, pos) are same in both files, I want to extract some of fields in the records, then make a new output file like this below. Output file has 15 fields from 2 input files like below.

- From file1
1chrom
2pos
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file1(the values from file1: (8mismatches+10deletions+12insertions)/ 4reads_all) 
- From file2
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file2(the values from file2: (8mismatches+10deletions+12insertions)/ 4reads_all) 
- New colume from both files.
pct_sub(the values from pct_file1 - pct_file2: ((8mismatches+10deletions+12insertions)/ 4reads_all) - ((8mismatches+10deletions+12insertions)/ 4reads_all))

In the output file, First 8 columns come from File1. treated.bam.tsv (the 8th column is the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from File1).

The rest of them come from File2, (File2. untreated.bam.tsv), and 13th column is also the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from file2).

The last field, pct_sub is calculated from the subtraction value with fields from file1(((8mismatches+10deletions+12insertions)/ 4reads_all)) and file2 from ((8mismatches+10deletions+12insertions)/ 4reads_all)).

And how can I add new column names in header of output files, like pct_file1, pct_file2, pct_sub

This is what I made for the above output file.(input and output files have all same format. TSV)

awk 'FNR==NR{array[$1,$2]=$0;next} { if ( $1 $2 in array ) print $1, $2, array[$3], array[$4], array[$8], array[$10], array[$12], (array[$8]+array[$10]+array[$12])/array[$4], $3, $4, $8, $10, $12, ($8+$10+$12)/$4, ((array[$8]+array[$10]+array[$12])/array[$4])-(($8+$10+$12)/$4) > "awkoutput.bam.tsv" }' treated.bam.tsv untreated.bam.tsv

(Actually, $1, $2 are not problem from File1 or File2)

**FILE1(treated)**                                                                                      

chrom   pos ref reads_all   reads_pp    matches matches_pp  mismatches  mismatches_pp   deletions   deletions_pp    insertions  insertions_pp   A   A_pp    C   C_pp    T   T_pp    G   G_pp    N   N_pp
chrY    59363551    G   8   0   7   0   0   0   1   0   5   0   0   0   0   0   0   0   7   0   0   0
chrY    59363552    G   7   0   7   0   0   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0
chrY    59363553    T   7   0   7   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0
chrY    59363554    G   7   0   7   0   0   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0
chrY    59363555    T   7   0   7   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0



**FILE2(untreated)**    
chrom   pos ref reads_all   reads_pp    matches matches_pp  mismatches  mismatches_pp   deletions   deletions_pp    insertions  insertions_pp   A   A_pp    C   C_pp    T   T_pp    G   G_pp    N   N_pp
chrY    59363551    G   2   0   2   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0   0   0
chrY    59363552    G   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
chrY    59363553    T   1   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0
chrY    59363554    G   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
chrY    59363555    T   1   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0

OUTPUT

chrom   pos ref reads_all   mismatches  deletions   insertions  pct_file1   ref reads_all   mismatches  deletions   insertions  pct_file2   pct_sub             
chrY    59363551    G   8   0   1   5   0.75    G   2   0   0   1   0.5 0.25                
chrY    59363552    G   7   0   0   0   0   G   1   0   0   0   0   0               
chrY    59363553    T   7   0   0   0   0   T   1   0   0   0   0   0               
chrY    59363554    G   7   0   0   0   0   G   1   0   0   0   0   0               
chrY    59363555    T   7   0   0   0   0   T   1   0   0   0   0   0

Thank you in advance.

-Jan

awk • 8.5k views
ADD COMMENT
1
Entering edit mode

I have a rather simple solution with excel/R. Merge fields chr and pos by a delimiter like -, so you have one column. Then you can use the VLOOKUP function in excel or the merge() function in R.

ADD REPLY
0
Entering edit mode

@Adrian Pelin Thank you for your solution but I cannot do it with excel because it is over the limitation of row number in excel.

ADD REPLY

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6