Question

Extract data from a large bed file

0

Entering edit mode

4.7 years ago

el24 ▴ 40

Hi all!

I have a big .bed file, which is about 100GB of data, and I want to extract specific columns from it in a timely manner. I was wondering if there is a useful tool that works best for huge bed files. (or could be bed.gz) Could you please tell me if you know any excellent tool (or Python package) that can help me generate my desire output?

An example of head of the data is:

chr1    10006   10018   M6176_1.02-NR2F1    0.00117 +   sequence=taaccctaaccc
chr1    10006   10020   M6432_1.02-PPARD    0.00034 +   sequence=taaccctaacccta
chr1    10008   10030   M6456_1.02-RREB1    0.00014 -   sequence=GGGTTAGGGTTAGGGTTAGGGT

And imagine I build my output using columns exact value of 1, 2, 3, 5, and 6th columns, and also TF of 4th column. Therefore, for the first line of my data, I like my output to be as follows:

chr1 10006 10018 NR2F1 0.00117 +

I know I can extract it using awk, by this command, but I hope there are faster ways to do it:

cat file.bed | awk '($5<0.01)gsub(/.*-/,"",$4) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}'

My second question is that if you can give me some recommendations to have criteria for limiting the values on one column, for example, keeping a row when the 5th column being less than 0.00001 values.

Thank you very much!

bed alignment genome • 3.5k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 4.7 years ago by el24 ▴ 40

2

Entering edit mode

Since you are interested in low-level text processing, I'd say that awk or basic python would be the way to go. Higher-level parsers provide useful shortcuts, but usually at the cost of speed.

ADD REPLY • link 4.7 years ago by liorglic ★ 1.4k

score 4 · Accepted Answer · 2020-03-22

Have a simple test. It turns out, python is a bit faster than awk for your question. (before I have the test, I think AWK is faster).

I generate two files (3-million lines, 200MB; 30-million lines, 2GB) for the test based on your example. and make a few changes: (5-column < 0.01)

The test files:

$ wc -l test_3m_rows.bed test_30m_rows.bed
   3000000 test_3m_rows.bed
  30000000 test_30m_rows.bed

Awk version:

time awk '$5<0.01{split($4,a,"-"); print $1"\t"$2"\t"$3"\t"a[2]"\t"$5"\t"$6}' test_3m_rows.bed > out1.bed

real    0m5.135s
user    0m5.012s
sys     0m0.120s

time awk '$5<0.01{split($4,a,"-"); print $1"\t"$2"\t"$3"\t"a[2]"\t"$5"\t"$6}' test_30m_rows.bed > out2.bed

real    0m53.181s
user    0m49.875s
sys     0m1.244s

python version:

$ time python foo.py test_3m_rows.bed > out3.bed

real    0m4.416s
user    0m4.195s
sys     0m0.220s

$ time python foo.py test_30m_rows.bed > out4.bed

real    0m46.769s
user    0m43.935s
sys     0m2.396s

The output of each command:

$ wc -l out*.bed
   3000000 out1.bed
  30000000 out2.bed
   3000000 out3.bed
  30000000 out4.bed

Here is the python script:

$ cat foo.py
#!/usr/env/bin python3

import sys

with open(sys.argv[1]) as r:
    for line in r:
        fields = line.split('\t')
        if float(fields[4]) >= 0.01:
            continue
        fields[3] = fields[3].split('-')[1]
        print('\t'.join(fields[:6]))