Creating numbered gene sets from a list of genes
2
1
Entering edit mode
3.5 years ago
storm1907 ▴ 30

Hello, I have this type of file - 2 column list with genes in 1st and chromosomal position in second column.

        1:924024
        1:924310
SAMD11  1:930353
SAMD11  1:930939
NOC2L   1:944858
NOC2L   1:946247
KLHL17  1:960891
KLHL17  1:961945

It is needed to be converted to this list type, so that 2 column format is saved. Each set number goes for each gene.

1:na         1:924024
2:na         1:924310
3:SAMD11    1:930353
3:SAMD11    1:930939
4:NOC2L     1:944858
4:NOC2L     1:946247
5:KLHL17    1:960891
5:KLHL17    1:961945

Adding column numbers is easy, but is much more specific issue. Is there a way, how to do such thing in bash?

Thank you!

Bash awk geneset • 1.4k views
ADD COMMENT
1
Entering edit mode
$ awk -F "\t" -v OFS="\t" '{ ($1=="")? ($1=NR":na"):($1=NR":"$1)}1' test.txt 

1:na    1:924024
2:na    1:924310
3:SAMD11    1:930353
4:SAMD11    1:930939
5:NOC2L     1:944858
6:NOC2L     1:946247
7:KLHL17    1:960891
8:KLHL17    1:961945
ADD REPLY
0
Entering edit mode

Thank you! I However, I struggle to make setID file, where each gene has different number, not each row:

1:na         1:924024
2:na         1:924310
3:SAMD11    1:930353
3:SAMD11    1:930939
4:NOC2L     1:944858
4:NOC2L     1:946247
5:KLHL17    1:960891
5:KLHL17    1:961945

SAMD11 is in set 3; NOC2L set 4, etc, etc.

ADD REPLY
1
Entering edit mode
3.5 years ago
cfos4698 ★ 1.1k

You can use the following command:

awk -F'\t' '!$1{ $1="na" }1' your_file.txt | nl -s":" -d$'\n' | sed "s/^ *//g"

Output:

1:na 1:924024
2:na 1:924310
3:SAMD11  1:930353
4:SAMD11  1:930939
5:NOC2L   1:944858
6:NOC2L   1:946247
7:KLHL17  1:960891
8:KLHL17  1:961945

The awk command replaces empty values in the first column with na, the nl command adds line numbers separated by a colon, and the sed command removes leading spaces introduced by nl.

edit: just noticed that you want duplicate ?genes? to be assigned the same number. I'll need someone better at awk to solve that issue!

ADD COMMENT
1
Entering edit mode

sed may not be necessary here.

$ awk -F'\t' '!$1{ $1="na" }1' your_file.txt | nl -w 1 -s":" -d$'\n'

If you wanted to use sed, following should have been enough:

$ nl -w 1 -s":" -d$'\n' test.txt | sed 's/:\s\+/:na\t/'

1:na    1:924024
2:na    1:924310
3:SAMD11    1:930353
4:SAMD11    1:930939
5:NOC2L     1:944858
6:NOC2L     1:946247
7:KLHL17    1:960891
8:KLHL17    1:961945
ADD REPLY
1
Entering edit mode
3.5 years ago

with datamash and awk:

$ datamash -g1  collapse 2 <test.txt  | nl -w1 -v 2| awk -F "\t" -v OFS="\t"  '{($2==" ")?($1=""):($1=$1)}; {($2==" ")?($2="na"):($2=$2)}1' | awk -F "\t" -v OFS='\t' '{split($3,a,",");for(i in a)print $1,$2,a[i]}'| awk -F "\t" -v OFS="\t" '$1=="" {$1=NR};{print $1":"$2,$3}'  

1:na    1:924024
2:na    1:924310
3:SAMD11    1:930353
3:SAMD11    1:930939
4:NOC2L 1:944858
4:NOC2L 1:946247
5:KLHL17    1:960891
5:KLHL17    1:961945

works on OP text (tab separated):

$ cat test.txt 
    1:924024
    1:924310
SAMD11  1:930353
SAMD11  1:930939
NOC2L   1:944858
NOC2L   1:946247
KLHL17  1:960891
KLHL17  1:961945
ADD COMMENT
0
Entering edit mode

I tried this, and got error:

invalid input: field 2 requested, line 8 has only 1 fields


usr$ head input
OR4F5   1:69063
FO538757.1      1:183937
AL669831.3      1:601436
AL669831.3      1:601667
AL669831.3      1:609395
AL669831.3      1:609407
AL669831.3      1:611317
1:923421
1:924024
1:924310
ADD REPLY
0
Entering edit mode

Please format the fields as the code mentioned. Please also post exact data format. Your OP data format and this format are different. Here is the output from current data. I changed the code considering both the formats the data. I made it generic.

$ cat test.txt

OR4F5   1:69063
FO538757.1  1:183937
AL669831.3  1:601436
AL669831.3  1:601667
AL669831.3  1:609395
AL669831.3  1:609407
AL669831.3  1:611317
    1:923421     
    1:924024     
    1:924310     

$ awk -F "\t" -v OFS="\t" '$1==" " {$1="na_"NR}1' test.txt | datamash -g1  collapse 2 | awk -F "\t" -v OFS='\t' '{split($2,a,",");for(i in a) print NR,$1,a[i]}' | awk -v OFS="\t" '{gsub(/_.*/,"",$2)}{print $1":"$2,$3}'

1:OR4F5 1:69063
2:FO538757.1    1:183937
3:AL669831.3    1:601436
3:AL669831.3    1:601667
3:AL669831.3    1:609395
3:AL669831.3    1:609407
3:AL669831.3    1:611317
4:na    1:923421
5:na    1:924024
6:na    1:924310
ADD REPLY

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6