bash script
3
Hello everyone,
I have a file like this:
RSID1 RSID2
chr1_169894240_G_T_b38 chr1_169894240_G_T_b38
chr1_169894240_G_T_b38 chr1_169891332_G_A_b38
chr1_169891332_G_A_b38 chr1_169891332_G_A_b38
chr1_169661963_G_A_b38 chr1_169661963_G_A_b38
chr1_169661963_G_A_b38 chr1_169697456_A_T_b38
chr1_169697456_A_T_b38 chr1_169697456_A_T_b38
chr1_27636786_T_C_b38 chr1_27636786_T_C_b38
chr1_196651787_C_T_b38 chr1_196651787_C_T_b38
chr6_143501715_T_C_b38 chr6_143501715_T_C_b38
I want to extract info just like:
chr1_169894240 chr1_169894240
.
I don't want to have other info. I just want chr_pos
I am confuse how to extract this info because the length is varying. In one case its 9 length and in other its 10. So if i use cut command for some its showing write value like chr_pos but for some its showing chr_pos_
Can anyone please help me out with this.
info
snp
model
substring
• 1.6k views
•
link
updated 3.3 years ago by
Ram
44k
•
written 3.3 years ago by
priyanka
▴
20
You can use cut
or awk
with "_" as field separator character, e.g., cut -f 1 yourfile.txt | awk -v FS="_" {print $1"_"$2}
. If you have a 2-column tsv file, you can try:
paste <(cut -f 1 yourfile.txt | awk -v FS="_" '{print $1"_"$2}') <(cut -f 2 yourfile.txt | awk -v FS="_" '{print $1"_"$2}')
$ sed -r 's/_\w_\w_\w{3}//g' test.txt
$ awk -v OFS="\t" -F '[_\t]' '{print $1"_"$2,$6"_"$7}' test.txt
$ parallel --colsep "_|\t" echo {1}_{2} {6}_{7} :::: test.txt | sed 's/\s/\t/'
chr1_169894240 chr1_169894240
chr1_169894240 chr1_169891332
chr1_169891332 chr1_169891332
chr1_169661963 chr1_169661963
chr1_169661963 chr1_169697456
chr1_169697456 chr1_169697456
chr1_27636786 chr1_27636786
chr1_196651787 chr1_196651787
chr6_143501715 chr6_143501715
For the win, can even do a fancy regex with sed
cat data.tsv
chr1_169894240_G_T_b38 chr1_169894240_G_T_b38
chr1_169894240_G_T_b38 chr1_169891332_G_A_b38
chr1_169891332_G_A_b38 chr1_169891332_G_A_b38
chr1_169661963_G_A_b38 chr1_169661963_G_A_b38
chr1_169661963_G_A_b38 chr1_169697456_A_T_b38
chr1_169697456_A_T_b38 chr1_169697456_A_T_b38
chr1_27636786_T_C_b38 chr1_27636786_T_C_b38
chr1_196651787_C_T_b38 chr1_196651787_C_T_b38
chr6_143501715_T_C_b38 chr6_143501715_T_C_b38
sed 's/_[ATGC]_[ATGC]_[a-z][0-9]*//g' data.tsv
chr1_169894240 chr1_169894240
chr1_169894240 chr1_169891332
chr1_169891332 chr1_169891332
chr1_169661963 chr1_169661963
chr1_169661963 chr1_169697456
chr1_169697456 chr1_169697456
chr1_27636786 chr1_27636786
chr1_196651787 chr1_196651787
chr6_143501715 chr6_143501715
Kevin
Login before adding your answer.
Traffic: 2572 users visited in the last hour
Thank you so much. It worked. Can you also share the link where I can learn in detail about awk command. I know just the basic of it