query UCSC db for CDS coordinates (Gencode)
2
0
Entering edit mode
20 months ago
bitpir ▴ 250

Hi,

I'm not sure how to obtain CDS coordinates from GENCODE using mysql on UCSC. Specifically, I'd like to obtain one bed per record of CDS like from the website.

I tried to query ...using the follow command but the db only has cdsStart and cdsEnd sites and not the coordinates of each CDS:

mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -Ne 'select * from wgEncodeGencodeCompV43 limit 10' hg38

enter image description here

I would like my output to look like this: enter image description here

Thanks for your help!

CDS mysql UCSC bed • 1.2k views
ADD COMMENT
0
Entering edit mode

You probably need to be looking at the knownGene table.

gene

ADD REPLY
2
Entering edit mode
20 months ago

The UCSC command line tool for this operation is "genePredToBed". To break a BED into one line per exon, use "bedToExons". The tools can be built from their git repo or just downloaded as binaries from here: https://hgdownload.soe.ucsc.edu/admin/exe/

I'm always mystified why Pierre Pierre Lindenbaum is rewriting the kent tools from scratch in Java if they already exist in C... The software license cannot be the reason, the command line tools are under an OSS MIT license... Pierre?

ADD COMMENT
0
Entering edit mode

1) it's fun :-D

2) there's is a bunch of amazing tools under https://hgdownload.soe.ucsc.edu/admin/exe/ but it's hard to discover what's new, what already exists, what's the purpose of each tool . samtools, gatk etc... have a release page on github. What is the equivalent for ucsc tools ?

ADD REPLY
0
Entering edit mode

it's hard to discover what's new, what already exists, what's the purpose of each tool . samtools, gatk etc... have a release page on github. What is the equivalent for ucsc tools ?

This file contains a list of tools and what they do : https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/FOOTER.txt

Not easily discoverable.

ADD REPLY
1
Entering edit mode
20 months ago

using kg2bed http://lindenb.github.io/jvarkit/KnownGenesToBed.html

$ wget -qO - "https://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/wgEncodeGencodeCompV43.txt.gz" | gunzip -c | java -jar ~/src/jvarkit-git/dist/kg2bed.jar  | awk '$6=="CDS" 
chr1    65564   65573   +   ENST00000641515.2   CDS Exon 2
chr1    69036   70008   +   ENST00000641515.2   CDS Exon 3
chr1    450739  451678  -   ENST00000426406.4   CDS Exon 1
chr1    685715  686654  -   ENST00000332831.5   CDS Exon 1
chr1    924431  924948  +   ENST00000616016.5   CDS Exon 1
chr1    925921  926013  +   ENST00000616016.5   CDS Exon 2
chr1    930154  930336  +   ENST00000616016.5   CDS Exon 3
chr1    931038  931089  +   ENST00000616016.5   CDS Exon 4
chr1    935771  935896  +   ENST00000616016.5   CDS Exon 5
chr1    939039  939129  +   ENST00000616016.5   CDS Exon 6

See also Hg19 regions for Intergenic, Promoters, Enhancer, Exon, Intron, 5-UTR, 3-UTR and so on

ADD COMMENT

Login before adding your answer.

Traffic: 2445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6