Hi folks
There is two input files.(File formats is tsv, both)
File1: treated.bam.tsv File2: untreated.bam.tsv
Both has ‘same fields’ like below. (I numbered and delimited with ‘/’) – each file has 23 fields.
1chrom/2pos/3ref/4reads_all/5reads_pp/6matches/7matches_pp/8mismatches/9mismatches_pp/10deletions/11deletions_pp/12insertions/13insertions_pp/14A/15A_pp/16C/17C_pp/18T/19T_pp/20G/21G_pp/22N/23N_pp
If values in 1st and 2nd columns(chrom, pos) are same in both files, I want to extract some of fields in the records, then make a new output file like this below. Output file has 15 fields from 2 input files like below.
- From file1
1chrom
2pos
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file1(the values from file1: (8mismatches+10deletions+12insertions)/ 4reads_all)
- From file2
3ref
4reads_all
8mismatches
10deletions
12insertions
pct_file2(the values from file2: (8mismatches+10deletions+12insertions)/ 4reads_all)
- New colume from both files.
pct_sub(the values from pct_file1 - pct_file2: ((8mismatches+10deletions+12insertions)/ 4reads_all) - ((8mismatches+10deletions+12insertions)/ 4reads_all))
In the output file, First 8 columns come from File1. treated.bam.tsv (the 8th column is the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from File1).
The rest of them come from File2, (File2. untreated.bam.tsv), and 13th column is also the value calculated with 8mismatches, 10deletions, 12insertions and 4reads_all from file2).
The last field, pct_sub is calculated from the subtraction value with fields from file1(((8mismatches+10deletions+12insertions)/ 4reads_all)) and file2 from ((8mismatches+10deletions+12insertions)/ 4reads_all)).
And how can I add new column names in header of output files, like pct_file1, pct_file2, pct_sub
This is what I made for the above output file.(input and output files have all same format. TSV)
awk 'FNR==NR{array[$1,$2]=$0;next} { if ( $1 $2 in array ) print $1, $2, array[$3], array[$4], array[$8], array[$10], array[$12], (array[$8]+array[$10]+array[$12])/array[$4], $3, $4, $8, $10, $12, ($8+$10+$12)/$4, ((array[$8]+array[$10]+array[$12])/array[$4])-(($8+$10+$12)/$4) > "awkoutput.bam.tsv" }' treated.bam.tsv untreated.bam.tsv
(Actually, $1, $2 are not problem from File1 or File2)
**FILE1(treated)**
chrom pos ref reads_all reads_pp matches matches_pp mismatches mismatches_pp deletions deletions_pp insertions insertions_pp A A_pp C C_pp T T_pp G G_pp N N_pp
chrY 59363551 G 8 0 7 0 0 0 1 0 5 0 0 0 0 0 0 0 7 0 0 0
chrY 59363552 G 7 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0
chrY 59363553 T 7 0 7 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0
chrY 59363554 G 7 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0
chrY 59363555 T 7 0 7 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0
**FILE2(untreated)**
chrom pos ref reads_all reads_pp matches matches_pp mismatches mismatches_pp deletions deletions_pp insertions insertions_pp A A_pp C C_pp T T_pp G G_pp N N_pp
chrY 59363551 G 2 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0
chrY 59363552 G 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
chrY 59363553 T 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
chrY 59363554 G 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
chrY 59363555 T 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
OUTPUT
chrom pos ref reads_all mismatches deletions insertions pct_file1 ref reads_all mismatches deletions insertions pct_file2 pct_sub
chrY 59363551 G 8 0 1 5 0.75 G 2 0 0 1 0.5 0.25
chrY 59363552 G 7 0 0 0 0 G 1 0 0 0 0 0
chrY 59363553 T 7 0 0 0 0 T 1 0 0 0 0 0
chrY 59363554 G 7 0 0 0 0 G 1 0 0 0 0 0
chrY 59363555 T 7 0 0 0 0 T 1 0 0 0 0 0
Thank you in advance.
-Jan
I have a rather simple solution with excel/R. Merge fields chr and pos by a delimiter like -, so you have one column. Then you can use the VLOOKUP function in excel or the merge() function in R.
@Adrian Pelin Thank you for your solution but I cannot do it with excel because it is over the limitation of row number in excel.