Question

Is there a more elegant way to print the list of all regions from a bed file?

0

Entering edit mode

2.0 years ago

Axzd ▴ 80

Hello, I have a bed file structured so

Chrom_3 0   20354777
Chrom_1 0   18146847
Chrom_5 0   16930519

I want this output (switch from 0based to 1-based)

Chrom_3 1
Chrom_3 2
Chrom_3 3
Chrom_3 4
Chrom_3 5
Chrom_3 6
Chrom_3 7
Chrom_3 8
Chrom_3 9

The solution I found is

bedtools makewindows -b vaga.bed -w 1 |awk '{print $1 "\t" $2}' > candidate_sites

It works but it seems to me awk alone might have been able to do it. I am very used to bedtools but I am sure there is an awkish way to do that. I just wonder how you deal with the "for each chromosome". Thanks

awk • 1.4k views

ADD COMMENT • link 2.0 years ago by Axzd ▴ 80

0

Entering edit mode

What is the third number in first block for? You want the output example go up to the number but the output stays two-column as shown?

ADD REPLY • link 2.0 years ago by GenoMax 153k

0

Entering edit mode

yes, the first file is a 0 based chrom_x start end and I need a file with each line giving a single positional point
I know, it's not the sharpest bioinformatics question ^^'

ADD REPLY • link 2.0 years ago by Axzd ▴ 80

score 2 · Accepted Answer · 2023-07-31

What exactly do you want to achieve? Take the record Chrom_3 0 20354777, move it by 1 and split it up to 20354776 intervals of 1bp?

If so, I recommend to stick with bedtools. awk is a general purpose software that can do amazing things, but bedtools is aware that you are working with genomic intervals. For example, bedtools shift takes a reference genome file and ensures that you are not shifting your intervals to positions larger than your scaffold. Unlikely to happen when shifting just by 1bp, but otherwise handy.

But if you want to use awk, you can of course:

awk 'BEGIN{OFS="\t"} /Chrom_3/ {i = $2+1; do { print $1,i; ++i } while (i <= $3) }'

This example comprises some optional elements you might or might not want to use:

BEGIN{OFS="\t"} declares that all fields should be tab-separated in the output. Very handy when you want to print many columns you do not explicitly want to type the seperator for.
/Chrom_3/ matches only records on chromosome 3. Partial matches are also possible with e.g. $1 ~ "_3"
The remainder is just the loop. How you add the offset doesn't matter - directly to i or later when printing.