After installing BEDOPS, here's a way to get a BED file of genes from a central reference like Gencode:
wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "gene"' - \
| convert2bed -i gff --attribute-key="gene_name" - \
> genes.bed
Likewise, for exons:
wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "exon"' - \
| convert2bed -i gff --attribute-key="gene_name" - \
> exons.bed
Via: https://bioinformatics.stackexchange.com/questions/895/how-to-obtain-bed-file-with-coordinates-of-all-genes
(Using the --attribute-key="gene_name"
option with convert2bed
will bring in HGNC symbol names, which in some contexts can be more commonly used (and useful) for gene names than Ensembl IDs.)
If you want specific genes (say you have a list of gene names or symbols in a file called genes_of_interest.txt
, you can use grep
:
grep -wfF genes_of_interest.txt genes.bed > genes_of_interest.bed
To do mapping of genes to a set of reads of interest, you can use BEDOPS bam2bed
to convert BAM and bedmap
to map:
bam2bed < reads.bam > reads.bed
bedmap --echo --echo-map-id-uniq reads.bed genes_of_interest.bed > answer.bed
Thank you for this detailed answer. BEDOPS seems like a wonderful tool.