find matching rows based on snp ids
2
0
Entering edit mode
9.2 years ago
gulcek ▴ 20

I have 2 files, one is about 7 MB and other is 1 GB. The first file seems:

rs58108140    A/G    chr1    10582
rs10218492    A/G    chr1    10827
rs10218493    A/G    chr1    10903

and the second one is:

rs58108140
rs10218493
rs11240777

I need to search second files snps in the first file and return the matching rows. It should be like that:

rs58108140    A/G    chr1    10582
rs10218493    A/G    chr1    10903

I tried to put all them in a database and take their inner join based on snpids but it is still working. Do you know a fast solution?

SNP • 2.0k views
ADD COMMENT
0
Entering edit mode

I strongly recommend alessia's solution! But just for the record: have you build an index over the columns that you join on (i.e., the columns containing the IDs)? Otherwise, it is not surprising that the query takes forever ;-)

ADD REPLY
1
Entering edit mode
9.2 years ago
alesssia ▴ 580

What you want to do is to extract the strings in file1 that matches (grep, in Unix) exactly (-F) the strings in file2 (-f) and save the result in file3, that is:

grep -F -f file2 file1 > file3
ADD COMMENT
1
Entering edit mode

Yup, like this, but make sure LC_ALL=C is set when you're using large files, or you'll be there all day! Either add export LC_ALL=C to your ~/.bashrc and re-source the file (source ~/.bashrc) or put in manually in the terminal.

LC_ALL=C grep -wFf file1 file2 > file3
ADD REPLY
0
Entering edit mode

You mean LC_ALL=C, not LS_ALL

ADD REPLY
1
Entering edit mode

I do indeed. Brain not engaged fully ...

ADD REPLY
0
Entering edit mode

Isn't grep going to do a "all vs all" comparison? With 7M and 1GB files it might take sometime...

ADD REPLY
1
Entering edit mode
9.2 years ago

Not tested for speed but this should be quite fast:

join <(sort -k1,1 aa.txt) <(sort -k1,1 bb.txt) | uniq > ab.txt

Skip the sort commands if files are already sorted of course.

ADD COMMENT

Login before adding your answer.

Traffic: 805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6