Hi y'all,
I have a .bed file file currently formatted like this:
chr start stop
1 14520 14812
1 65409 65725
1 65731 66073
1 69381 69700
1 721281 722042
1 752816 753135
I would like to get something that looks like this (where for overlapping exons, use the exon boundary, unless this boundary was <10bp, in which case expand the probes to include at minimum 10bp of sequence):
chr start stop name
1 69090 70008 OR4F5
1 565876 566576
1 801642 802733
1 861321 861393 SAMD11
1 865534 865716 SAMD11
Is there a way that I can use UCSC or another tool to accomplish this?
Improving this with a pipe from curl to gunzip (without writing gtf.gz):
:-)
Thanks for the feedback. Sometimes it is useful to keep the annotations file, but keep the compressed version to save disk space, and extracting it only as needed to do analyses. Particularly, network access to download a large file can be a costly part of analyses, in terms of time, especially repeating it unnecessarily. In any case, there are lots of ways to use
wget
orcurl
to follow either approach.