Protein coding mm10 refseq bed
2
2
Entering edit mode
6.1 years ago
rbronste ▴ 420

Just trying to export a bed file from table browser for protein coding gene body locations in mm10 containing the following header/columns:

chr start end NA genename NMname strand

Not sure if there is a more straightforward way to get the following arrangement, thanks!

refseq bed mm10 • 3.8k views
ADD COMMENT
2
Entering edit mode
6.1 years ago

Use the Selected fields option in Output format and click on get output then choose required columns from selection page.

Link to table browser

Table Browser

Select columns:

Selection  Page

ADD COMMENT
0
Entering edit mode
6.1 years ago
vkkodali_ncbi ★ 3.8k

If you are interested in RefSeq data, why not download the GFF3 annotation from NCBI and parse that file? You can download the GFF3 file from RefSeq FTP site here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus/GFF_interim/interim_GRCm38.p6_top_level_2017-09-26.gff3.gz

A gene can be protein-coding and yet have one or more non-coding transcript variants. Hence, you need to first get the list of gene_ids that are coding at least one protein. You can do so by parsing the GFF3 file as follows:

zgrep -v '^#' interim_GRCm38.p6_top_level_2017-09-26.gff3.gz | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' | grep -o 'GeneID:[0-9]*' | sort -u > ~/GRCm38.p6_protein_coding_genes.txt

Then, you can grep for those geneids in the GFF3 file where the column 3 has gene to get the entire range of the gene and strand. It is unclear to me whether you are interested in just the range for gene or each transcript variant (because one of your columns is NM). Depending on exactly what you want, it is fairly easy to come up with an appropriate unix command to parse the GFF3 file and return a bed-style file.

ADD COMMENT

Login before adding your answer.

Traffic: 2827 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6