Question

Creating numbered gene sets from a list of genes

1

Entering edit mode

3.5 years ago

storm1907 ▴ 30

Hello, I have this type of file - 2 column list with genes in 1st and chromosomal position in second column.

        1:924024
        1:924310
SAMD11  1:930353
SAMD11  1:930939
NOC2L   1:944858
NOC2L   1:946247
KLHL17  1:960891
KLHL17  1:961945

It is needed to be converted to this list type, so that 2 column format is saved. Each set number goes for each gene.

1:na         1:924024
2:na         1:924310
3:SAMD11    1:930353
3:SAMD11    1:930939
4:NOC2L     1:944858
4:NOC2L     1:946247
5:KLHL17    1:960891
5:KLHL17    1:961945

Adding column numbers is easy, but is much more specific issue. Is there a way, how to do such thing in bash?

Thank you!

Bash awk geneset • 1.4k views

ADD COMMENT • link updated 3.5 years ago by cpad0112 21k • written 3.5 years ago by storm1907 ▴ 30

1

Entering edit mode

$ awk -F "\t" -v OFS="\t" '{ ($1=="")? ($1=NR":na"):($1=NR":"$1)}1' test.txt 

1:na    1:924024
2:na    1:924310
3:SAMD11    1:930353
4:SAMD11    1:930939
5:NOC2L     1:944858
6:NOC2L     1:946247
7:KLHL17    1:960891
8:KLHL17    1:961945

ADD REPLY • link 3.5 years ago by cpad0112 21k

0

Entering edit mode

Thank you! I However, I struggle to make setID file, where each gene has different number, not each row:

1:na         1:924024
2:na         1:924310
3:SAMD11    1:930353
3:SAMD11    1:930939
4:NOC2L     1:944858
4:NOC2L     1:946247
5:KLHL17    1:960891
5:KLHL17    1:961945

SAMD11 is in set 3; NOC2L set 4, etc, etc.

ADD REPLY • link 3.5 years ago by storm1907 ▴ 30

score 1 · Answer 1 · 2021-05-24

1

Entering edit mode

3.5 years ago

cfos4698 ★ 1.1k

You can use the following command:

awk -F'\t' '!$1{ $1="na" }1' your_file.txt | nl -s":" -d$'\n' | sed "s/^ *//g"

Output:

1:na 1:924024
2:na 1:924310
3:SAMD11  1:930353
4:SAMD11  1:930939
5:NOC2L   1:944858
6:NOC2L   1:946247
7:KLHL17  1:960891
8:KLHL17  1:961945

The awk command replaces empty values in the first column with na, the nl command adds line numbers separated by a colon, and the sed command removes leading spaces introduced by nl.

edit: just noticed that you want duplicate ?genes? to be assigned the same number. I'll need someone better at awk to solve that issue!

ADD COMMENT • link 3.5 years ago by cfos4698 ★ 1.1k

1

Entering edit mode

sed may not be necessary here.

$ awk -F'\t' '!$1{ $1="na" }1' your_file.txt | nl -w 1 -s":" -d$'\n'

If you wanted to use sed, following should have been enough:

$ nl -w 1 -s":" -d$'\n' test.txt | sed 's/:\s\+/:na\t/'

1:na    1:924024
2:na    1:924310
3:SAMD11    1:930353
4:SAMD11    1:930939
5:NOC2L     1:944858
6:NOC2L     1:946247
7:KLHL17    1:960891
8:KLHL17    1:961945

ADD REPLY • link 3.5 years ago by cpad0112 21k

score 1 · Answer 2 · 2021-05-24

1

Entering edit mode

3.5 years ago

cpad0112 21k

with datamash and awk:

$ datamash -g1  collapse 2 <test.txt  | nl -w1 -v 2| awk -F "\t" -v OFS="\t"  '{($2==" ")?($1=""):($1=$1)}; {($2==" ")?($2="na"):($2=$2)}1' | awk -F "\t" -v OFS='\t' '{split($3,a,",");for(i in a)print $1,$2,a[i]}'| awk -F "\t" -v OFS="\t" '$1=="" {$1=NR};{print $1":"$2,$3}'  

1:na    1:924024
2:na    1:924310
3:SAMD11    1:930353
3:SAMD11    1:930939
4:NOC2L 1:944858
4:NOC2L 1:946247
5:KLHL17    1:960891
5:KLHL17    1:961945

works on OP text (tab separated):

$ cat test.txt 
    1:924024
    1:924310
SAMD11  1:930353
SAMD11  1:930939
NOC2L   1:944858
NOC2L   1:946247
KLHL17  1:960891
KLHL17  1:961945

ADD COMMENT • link 3.5 years ago by cpad0112 21k

0

Entering edit mode

I tried this, and got error:

invalid input: field 2 requested, line 8 has only 1 fields


usr$ head input
OR4F5   1:69063
FO538757.1      1:183937
AL669831.3      1:601436
AL669831.3      1:601667
AL669831.3      1:609395
AL669831.3      1:609407
AL669831.3      1:611317
1:923421
1:924024
1:924310

ADD REPLY • link 3.5 years ago by storm1907 ▴ 30

0

Entering edit mode

Please format the fields as the code mentioned. Please also post exact data format. Your OP data format and this format are different. Here is the output from current data. I changed the code considering both the formats the data. I made it generic.

$ cat test.txt

OR4F5   1:69063
FO538757.1  1:183937
AL669831.3  1:601436
AL669831.3  1:601667
AL669831.3  1:609395
AL669831.3  1:609407
AL669831.3  1:611317
    1:923421     
    1:924024     
    1:924310     

$ awk -F "\t" -v OFS="\t" '$1==" " {$1="na_"NR}1' test.txt | datamash -g1  collapse 2 | awk -F "\t" -v OFS='\t' '{split($2,a,",");for(i in a) print NR,$1,a[i]}' | awk -v OFS="\t" '{gsub(/_.*/,"",$2)}{print $1":"$2,$3}'

1:OR4F5 1:69063
2:FO538757.1    1:183937
3:AL669831.3    1:601436
3:AL669831.3    1:601667
3:AL669831.3    1:609395
3:AL669831.3    1:609407
3:AL669831.3    1:611317
4:na    1:923421
5:na    1:924024
6:na    1:924310

ADD REPLY • link 3.5 years ago by cpad0112 21k