Extract data using awk/sed and output to different files
3
1
Entering edit mode
8.6 years ago
Tao ▴ 540

Hi guys,

I have a specific problem about using awk or sed to split a big file to different files. The big file is like this format(3 columns):

C    SRR1_45/1    data...
U    SRR2_34/2    data...
U    SRR1_33/2    data...
C    SRR3_22/1    data...
....

I want to extract lines with SRR1 to SRR1.txt, lines with SRR2 to SRR2.txt ...lines with SRRn to SRRn.txt. And the output lines should remove 'SRRi_' symbol. But we don't how many n are there.

e.g. SRR1.txt will contain:
C    45/1    data...
U    33/2    data...

I know it's easy to write a python or perl script to do it. But is there a shell way to do it? taking the advantages of awk or sed. Let me add some details: I have 10 such big files to be extracted. And each has more than 1000M lines. So I need to find a efficient way. The n is random which is not from sequential array.

Thanks! Tao

awk sed shell • 9.3k views
ADD COMMENT
4
Entering edit mode
8.6 years ago

Here is a simple way to do it without sorting and with awk:

$ awk '{ split($2, a, "_"); print $1"\t"a[2]"\t"$3 >> a[1]".txt"; }' foo.txt

The file foo.txt is a three-column tab-delimited text file containing your data.

Using the >> operator appends a line to whatever SRR*.txt file exists. Therefore, if you re-run this one-liner, you must first delete any previously-made SRR*.txt files, or you will get duplicate lines.

This should be pretty fast, as you're not sorting on IDs. It would be faster, probably, to use a Perl-based approach that opens a pool of file handles, but this should work fine.

Further, if you don't care about the order of lines in the split files, you could use GNU Parallel with this one-liner to split multiple files foo1.txt, foo2.txt, etc. simultaneously. Doing the work in parallel may hit a file I/O bottleneck but could give you an overall speed boost, if you use SSDs or other fast storage.

ADD COMMENT
0
Entering edit mode

Thanks Alex! Your answer is amazing, especially the parallel way you introduced to me. Thank you so much! Best, Tao

ADD REPLY
2
Entering edit mode
8.6 years ago
kloetzl ★ 1.1k

for ((i=0;i<10;i++)); do grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt; done

Increase the limit as necessary. I leave it to you to delete the empty files.

ADD COMMENT
0
Entering edit mode

great! I think it's a easier one-step work if we know how many n are there.

ADD REPLY
0
Entering edit mode

Hi Kloetzl, Thank you for your reply. Your answer is awesome if the i is from sequential array. But unfortunately, i represents a uniq ID which is random. Sorry for the missed information. Best, Tao

ADD REPLY
0
Entering edit mode

Well, in that case, just read all possible is first.

#!/bin/sh
A=`cat data | grep -o 'SRR.*_' | sort | uniq | tr -cd '0-9\n'`
for i in $A; do
    grep "SRR${i}_" data | sed "s/SRR.*_//" > SRR${i}.txt;
done
ADD REPLY
1
Entering edit mode

Thank you for your following up. Will it be efficient to handle big files with more than 1000M lines?

ADD REPLY
1
Entering edit mode

The sorting may take a while, but I don't see an easy way around thaĊ£ atm, without using a "real" programming language. May be there is a way in awk to extract the SRRi part and then print $0 > SRR$i.txt but I am too tired to think about that just now.

ADD REPLY
0
Entering edit mode

Thanks kloetzl! @Alex Reynolds just introduced me an efficient parallel way. It will be great if you are also interested. Tao.

ADD REPLY
1
Entering edit mode
8.6 years ago

I'm also a beginner in GNU so please excuse me if it seems dumb to you: first of all, what is the delimiter of the big file? the following codes are based on TAB as a delimiter:

  1. create a list including all the SRRn

    cut -f 2 input_name |cut -d"_" -f 1 |sort |uniq > list.txt

  2. use while loop to get all you need

    while read -r f1; do grep $f1 input_name |sed "s/${f1}_//g" > $f1.txt ; done < list.txt

you should be able to get what you need. good luck. :)

Charlie

ADD COMMENT
0
Entering edit mode

Hi Charlie, Thank you for your reply. Your answer is great and viable. But the file is very big, about 50G, there are more than 1000M lines. So I think it's not very efficient to use cut and sort first. And I have about 10 such big files to be extracted. But your answer is still perfect for small files. Thanks. Tao.

ADD REPLY
0
Entering edit mode

sure, will be interested to know the most efficient way of doing this as well. Charlie

ADD REPLY
0
Entering edit mode

Hi Charlie, @Alex Reynolds give me an excellent solution. It will be very efficient to use the parallel way. Tao.

ADD REPLY

Login before adding your answer.

Traffic: 1695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6