Question

string replacement from one file to another

0

Entering edit mode

5.2 years ago

jordi.planells ▴ 480

Hi all, I need your help.

I am trying to go from ucsc name convention to the ensembl one. I have a .bed file with an annotation and I have a .txt file with the convention equivalence in 2 different columns. The files look like:

.bed
chr2L 324 453
chr3R 65433 73563
chr4 5345 9854
... etc
.txt equivalence
chr2L 2L
chr3L 3L
chr4 4L
... etc

I know you can use sed 's/chr2L/2L/g' for replacing the patterns. However, doing it for all the chromosomes and scaffolds (approximately 2000 different ones) is not feasible.

I am looking for a script (I don't mind the programming language) or a tool that works as:

Read the equivalence file, store the strings. Read the .bed file and be able to perform the string replacement in the chromosome field.

Thank you in advance, have a great day! Best,

Jordi

bash pyhton Assembly • 976 views

ADD COMMENT • link updated 5.2 years ago by Pierre Lindenbaum 166k • written 5.2 years ago by jordi.planells ▴ 480

0

Entering edit mode

using tsv-utils :

tsv-join -f test.bed test.txt --key-fields 1 --append-fields 2,3 | awk -v OFS="\t" '{print $3,$4,$2}'
324 453 2L
5345    9854    4L

with awk:

awk -v OFS="\t" 'NR==FNR {a[$1]=$1"\t"$2;next} ($1 in a) {print $2,$3,a[$1]}' test.txt test.bed | awk '{print $1,$2,$4}'     

324 453 2L
5345 9854 4L

ADD REPLY • link 5.2 years ago by cpad0112 21k

0

Entering edit mode

Tank you as well! AWK is awesome and terribly powerful. I will spend some time and try to master it

ADD REPLY • link 5.2 years ago by jordi.planells ▴ 480

score 3 · Accepted Answer · 2020-05-06

3

Entering edit mode

5.2 years ago

Pierre Lindenbaum 166k

join -t $'\t' -1 1 -2 1 <(sort -t $'\t' -k1,1 file2.tsv) <(sort -t $'\t' -k1,1 file1.tsv) | cut -f 2-

otherwise, I wrote a tool to substitute the chromosomes' names. : http://lindenb.github.io/jvarkit/ConvertBedChromosomes.html