Question

Filter dosage file by list of SNP IDs

0

Entering edit mode

3.3 years ago

James ▴ 10

Hello, does anyone by any chance know of a fast/computationally efficient way to select lines in a .dosage file if the first column's SNP ID is also contained within a .txt document of SNP IDs?

The .dosage file is in the following format:

SNPID Position REF ALT Sample1Dosage Sample2Dosage Sample3Dosage . . .
1:100:A:C A C 0 2 1 . . .
1:101:C:T C T 1 2 1 . . .
. . .

The list of SNP IDs in a .txt document is in the following format:

1:100:A:C
1:101:C:T
1:103:G:A
1:105:C:T

. . .

I have tried using grep -f snp_IDs.txt example.dosage > filtered_example.dosage, but the command is unfortunately too slow for my server to run it without hitting the max wall time

dosage snp genomics • 727 views

ADD COMMENT • link 3.3 years ago by James ▴ 10

score 1 · Answer 1 · 2021-08-28

Found the solution myself, but keeping this question up for others who may run into the same problem. Instead of using:

grep -f snp_IDs.txt example.dosage > filtered_example.dosage

Use:

grep -F -f snp_IDs.txt example.dosage > filtered_example.dosage

This runs extremely fast! (as long as you don't have to filter on any regex expressions)