Question

Extract data using awk/sed and output to different files

1

Entering edit mode

8.6 years ago

Tao ▴ 540

Hi guys,

I have a specific problem about using awk or sed to split a big file to different files. The big file is like this format(3 columns):

C    SRR1_45/1    data...
U    SRR2_34/2    data...
U    SRR1_33/2    data...
C    SRR3_22/1    data...
....

I want to extract lines with SRR1 to SRR1.txt, lines with SRR2 to SRR2.txt ...lines with SRRn to SRRn.txt. And the output lines should remove 'SRRi_' symbol. But we don't how many n are there.

e.g. SRR1.txt will contain:
C    45/1    data...
U    33/2    data...

I know it's easy to write a python or perl script to do it. But is there a shell way to do it? taking the advantages of awk or sed. Let me add some details: I have 10 such big files to be extracted. And each has more than 1000M lines. So I need to find a efficient way. The n is random which is not from sequential array.

Thanks! Tao

awk sed shell • 9.3k views

ADD COMMENT • link updated 8.6 years ago by Alex Reynolds 35k • written 8.6 years ago by Tao ▴ 540

2

Entering edit mode

8.6 years ago

kloetzl ★ 1.1k

for ((i=0;i<10;i++)); do grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt; done

Increase the limit as necessary. I leave it to you to delete the empty files.

ADD COMMENT • link 8.6 years ago by kloetzl ★ 1.1k

0

Entering edit mode

great! I think it's a easier one-step work if we know how many n are there.

ADD REPLY • link 8.6 years ago by air.chuan.1987 ▴ 20

0

Entering edit mode

Hi Kloetzl, Thank you for your reply. Your answer is awesome if the i is from sequential array. But unfortunately, i represents a uniq ID which is random. Sorry for the missed information. Best, Tao

ADD REPLY • link 8.6 years ago by Tao ▴ 540

0

Entering edit mode

Well, in that case, just read all possible is first.

#!/bin/sh
A=`cat data | grep -o 'SRR.*_' | sort | uniq | tr -cd '0-9\n'`
for i in $A; do
    grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt;
done

ADD REPLY • link 8.6 years ago by kloetzl ★ 1.1k

1

Entering edit mode

Thank you for your following up. Will it be efficient to handle big files with more than 1000M lines?

ADD REPLY • link 8.6 years ago by Tao ▴ 540

1

Entering edit mode

The sorting may take a while, but I don't see an easy way around thaţ atm, without using a "real" programming language. May be there is a way in awk to extract the SRRi part and then print $0 > SRR$i.txt but I am too tired to think about that just now.

ADD REPLY • link 8.6 years ago by kloetzl ★ 1.1k

0

Entering edit mode

Thanks kloetzl! @Alex Reynolds just introduced me an efficient parallel way. It will be great if you are also interested. Tao.

ADD REPLY • link 8.6 years ago by Tao ▴ 540

1

Entering edit mode

8.6 years ago

air.chuan.1987 ▴ 20

I'm also a beginner in GNU so please excuse me if it seems dumb to you: first of all, what is the delimiter of the big file? the following codes are based on TAB as a delimiter:

create a list including all the SRRn

cut -f 2 input_name |cut -d"_" -f 1 |sort |uniq > list.txt
use while loop to get all you need

while read -r f1; do grep $f1 input_name |sed "s/${f1}_//g" > $f1.txt ; done < list.txt

you should be able to get what you need. good luck. :)

Charlie

ADD COMMENT • link 8.6 years ago by air.chuan.1987 ▴ 20

0

Entering edit mode

Hi Charlie, Thank you for your reply. Your answer is great and viable. But the file is very big, about 50G, there are more than 1000M lines. So I think it's not very efficient to use cut and sort first. And I have about 10 such big files to be extracted. But your answer is still perfect for small files. Thanks. Tao.

ADD REPLY • link 8.6 years ago by Tao ▴ 540

0

Entering edit mode

sure, will be interested to know the most efficient way of doing this as well. Charlie

ADD REPLY • link 8.6 years ago by air.chuan.1987 ▴ 20

0

Entering edit mode

Hi Charlie, @Alex Reynolds give me an excellent solution. It will be very efficient to use the parallel way. Tao.

ADD REPLY • link 8.6 years ago by Tao ▴ 540

score 4 · Accepted Answer · 2016-04-19

Here is a simple way to do it without sorting and with awk:

$ awk '{ split($2, a, "_"); print $1"\t"a[2]"\t"$3 >> a[1]".txt"; }' foo.txt

The file foo.txt is a three-column tab-delimited text file containing your data.

Using the >> operator appends a line to whatever SRR*.txt file exists. Therefore, if you re-run this one-liner, you must first delete any previously-made SRR*.txt files, or you will get duplicate lines.

This should be pretty fast, as you're not sorting on IDs. It would be faster, probably, to use a Perl-based approach that opens a pool of file handles, but this should work fine.

Further, if you don't care about the order of lines in the split files, you could use GNU Parallel with this one-liner to split multiple files foo1.txt, foo2.txt, etc. simultaneously. Doing the work in parallel may hit a file I/O bottleneck but could give you an overall speed boost, if you use SSDs or other fast storage.