Question

How to write a code in Linux to manipulate a large file with grep/awk

0

Entering edit mode

5.8 years ago

OAJn8634 ▴ 60

Following the discussion a previous discussion (https://www.biostars.org/p/171557/), I would like to prepare a file for converting chr:pos to rs. For this, I have downloaded a list of all SNPs from UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables). It looks like this:

#chrom  chromEnd    name
chr1    100663297   rs1235665665
chr1    1048577     rs1346354302
chr1    62914560    rs538775156

I would like it to look like this:

1:100663297 rs1235665665
1:1048577   rs1346354302
1:62914560  rs538775156

I could do it in R; however, the file is so large that it crashes my computer before it even loads. I was told that Linux commands such as grep and awk are amazing for such things, and can handle very large file efficiently. Unfortunately, I do not have a slightest idea of how even to begin writing the code in Linux to achieve my goal. Could you please help me with this? Thank you very much.

PS I am unsure whether the title to my question is efficient in describing my problem. Please let me know if it is not and I will edit it.

linux R rs chr awk • 1.5k views

ADD COMMENT • link updated 5.8 years ago by Alex Reynolds 36k • written 5.8 years ago by OAJn8634 ▴ 60

3

Entering edit mode

This is really basic question. You should starting searching the web for how awk works. Have a look for examples at this tutorial side.

Of course I could give you the solution. But then you will not learn that much :)

ADD REPLY • link 5.8 years ago by finswimmer 16k

3

Entering edit mode

Edit : Along to finswimmer comment, here are some other links

Hello OAJn8634

Please take a look at the man of awk, sed, grep, cut to see how to use it. Give us your best try and we will take a look how to modify it to fit your attent

Awk in Bioinformatics

https://stackoverflow.com/questions/29275971/need-to-remove-the-string-chr-and-the-sign-from-the-file

ADD REPLY • link 5.8 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Hello Bastien Hervé and finswimmer, Thank you very much for the very useful links on how get me started, and for offering to help. I really appreciate it.

ADD REPLY • link 5.8 years ago by OAJn8634 ▴ 60

score 2 · Answer 1 · 2019-01-29

2

Entering edit mode

5.8 years ago

Alex Reynolds 36k

One way:

$ tail -n+2 in.txt | sed 's/^chr//' | awk -v OFS="\t" '{ print $1":"$2, $3 }' > out.txt

Test it out, first:

$ head -10 in.txt | tail -n+2 | ... > out.txt

Break things down by seeing how a snippet of your file looks after each step.

ADD COMMENT • link 5.8 years ago by Alex Reynolds 36k

1

Entering edit mode

Alternatively:

sed -e '1,1d' -e 's/^chr//' in > out

Or e.g.

awk 'BEGIN{OFS=FS="\t"}NR>1{sub(/^chr/,"",$1);print $0}' in > out

Edit. I didn't notice the ":" requirement. These don't actually do that..

Corrected:

sed -e '1,1d' -e 's/^chr//' -e 's/\t/:/' in > out

awk 'BEGIN{OFS=FS="\t"}NR>1{sub(/^chr/,"",$1);print $1":",$2,$3}' in > out

ADD REPLY • link 5.8 years ago by 5heikki 11k

0

Entering edit mode

Thank you very much 5heikki for your help.

ADD REPLY • link 5.8 years ago by OAJn8634 ▴ 60

0

Entering edit mode

Dear Alex, Thank you so much for your help. It has worked like magic! Thank you

ADD REPLY • link 5.8 years ago by OAJn8634 ▴ 60

1

Entering edit mode

Let's see what you have learned: Please describe what this command line is doing. :)

ADD REPLY • link 5.8 years ago by finswimmer 16k