Hi all!
I have a big .bed file, which is about 100GB of data, and I want to extract specific columns from it in a timely manner. I was wondering if there is a useful tool that works best for huge bed files. (or could be bed.gz) Could you please tell me if you know any excellent tool (or Python package) that can help me generate my desire output?
An example of head of the data is:
chr1 10006 10018 M6176_1.02-NR2F1 0.00117 + sequence=taaccctaaccc
chr1 10006 10020 M6432_1.02-PPARD 0.00034 + sequence=taaccctaacccta
chr1 10008 10030 M6456_1.02-RREB1 0.00014 - sequence=GGGTTAGGGTTAGGGTTAGGGT
And imagine I build my output using columns exact value of 1, 2, 3, 5, and 6th columns, and also TF of 4th column. Therefore, for the first line of my data, I like my output to be as follows:
chr1 10006 10018 NR2F1 0.00117 +
I know I can extract it using awk, by this command, but I hope there are faster ways to do it:
cat file.bed | awk '($5<0.01)gsub(/.*-/,"",$4) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}'
My second question is that if you can give me some recommendations to have criteria for limiting the values on one column, for example, keeping a row when the 5th column being less than 0.00001 values.
Thank you very much!
Since you are interested in low-level text processing, I'd say that awk or basic python would be the way to go. Higher-level parsers provide useful shortcuts, but usually at the cost of speed.