Question

Awk- How to compare 2 files with common columns and then get the output file with columns from each file.

0

Entering edit mode

8.1 years ago

bright.g.january1732 • 0

Hi folks

There is two input files.(File formats is tsv, both)

File1: treated.bam.tsv File2: untreated.bam.tsv

Both has ‘same fields’ like below. (I numbered and delimited with ‘/’) – each file has 23 fields.

1chrom/2pos/3ref/4reads_all/5reads_pp/6matches/7matches_pp/8mismatches/9mismatches_pp/10deletions/11deletions_pp/12insertions/13insertions_pp/14A/15A_pp/16C/17C_pp/18T/19T_pp/20G/21G_pp/22N/23N_pp

If values in 1st and 2nd columns(chrom, pos) are same in both files, I want to extract some of fields in the records, then make a new output file like this below. Output file has 15 fields from 2 input files like below.

- From file1
1chrom
2pos
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file1(the values from file1: (8mismatches+10deletions+12insertions)/ 4reads_all) 
- From file2
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file2(the values from file2: (8mismatches+10deletions+12insertions)/ 4reads_all) 
- New colume from both files.
pct_sub(the values from pct_file1 - pct_file2: ((8mismatches+10deletions+12insertions)/ 4reads_all) - ((8mismatches+10deletions+12insertions)/ 4reads_all))

In the output file, First 8 columns come from File1. treated.bam.tsv (the 8th column is the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from File1).

The rest of them come from File2, (File2. untreated.bam.tsv), and 13th column is also the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from file2).

The last field, pct_sub is calculated from the subtraction value with fields from file1(((8mismatches+10deletions+12insertions)/ 4reads_all)) and file2 from ((8mismatches+10deletions+12insertions)/ 4reads_all)).

And how can I add new column names in header of output files, like pct_file1, pct_file2, pct_sub

This is what I made for the above output file.(input and output files have all same format. TSV)

awk 'FNR==NR{array[$1,$2]=$0;next} { if ( $1 $2 in array ) print $1, $2, array[$3], array[$4], array[$8], array[$10], array[$12], (array[$8]+array[$10]+array[$12])/array[$4], $3, $4, $8, $10, $12, ($8+$10+$12)/$4, ((array[$8]+array[$10]+array[$12])/array[$4])-(($8+$10+$12)/$4) > "awkoutput.bam.tsv" }' treated.bam.tsv untreated.bam.tsv

(Actually, $1, $2 are not problem from File1 or File2)

**FILE1(treated)**                                                                                      

chrom   pos ref reads_all   reads_pp    matches matches_pp  mismatches  mismatches_pp   deletions   deletions_pp    insertions  insertions_pp   A   A_pp    C   C_pp    T   T_pp    G   G_pp    N   N_pp
chrY    59363551    G   8   0   7   0   0   0   1   0   5   0   0   0   0   0   0   0   7   0   0   0
chrY    59363552    G   7   0   7   0   0   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0
chrY    59363553    T   7   0   7   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0
chrY    59363554    G   7   0   7   0   0   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0
chrY    59363555    T   7   0   7   0   0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0



**FILE2(untreated)**    
chrom   pos ref reads_all   reads_pp    matches matches_pp  mismatches  mismatches_pp   deletions   deletions_pp    insertions  insertions_pp   A   A_pp    C   C_pp    T   T_pp    G   G_pp    N   N_pp
chrY    59363551    G   2   0   2   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0   0   0
chrY    59363552    G   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
chrY    59363553    T   1   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0
chrY    59363554    G   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
chrY    59363555    T   1   0   1   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0

OUTPUT

chrom   pos ref reads_all   mismatches  deletions   insertions  pct_file1   ref reads_all   mismatches  deletions   insertions  pct_file2   pct_sub             
chrY    59363551    G   8   0   1   5   0.75    G   2   0   0   1   0.5 0.25                
chrY    59363552    G   7   0   0   0   0   G   1   0   0   0   0   0               
chrY    59363553    T   7   0   0   0   0   T   1   0   0   0   0   0               
chrY    59363554    G   7   0   0   0   0   G   1   0   0   0   0   0               
chrY    59363555    T   7   0   0   0   0   T   1   0   0   0   0   0

Thank you in advance.

-Jan

awk • 8.5k views

ADD COMMENT • link 8.1 years ago by bright.g.january1732 • 0

1

Entering edit mode

I have a rather simple solution with excel/R. Merge fields chr and pos by a delimiter like -, so you have one column. Then you can use the VLOOKUP function in excel or the merge() function in R.

ADD REPLY • link 8.1 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

@Adrian Pelin Thank you for your solution but I cannot do it with excel because it is over the limitation of row number in excel.

ADD REPLY • link 8.1 years ago by bright.g.january1732 • 0