How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols
3
21
Entering edit mode
10.9 years ago
pristanna ▴ 750

Hi everyone,

I need a bed file containing starts and ends of the exons, gene name and number of the exon - but just for the canonical transcript - because I don't want to have the regions repeated.

I tried UCSC Table browser and BioMart but non of them gives me exactly what I need.

Using UCSC I can get canonical transcripts (but at the same time I couldn't get starts and ends of the exon altogether with the gene name).

Using BioMart I can get starts and ends of the exons altogether with the gene name, but I can't find how to choose the canonical transcripts and that's why the regions are repeated as seen below in the case of NUS1P3:

Chromosome Name    Exon Chr Start (bp)    Exon Chr End (bp)    Associated Gene Name    Exon Rank in Transcript
1    955551  955753  AGRN    1
1    957579  957842  AGRN    2
1    970629  970730  AGRN    3
13     24902349    24903210    NUS1P3    1
13     24902376    24902559    NUS1P3    1
13  24902577    24902961    NUS1P3    2
13     24903006    24903210    NUS1P3    3

Thank you in advance for any suggestions!

bed transcript ucsc biomart • 44k views
ADD COMMENT
45
Entering edit mode
10.9 years ago
pristanna ▴ 750

Download a bed file for the canonical transcripts using UCSC Table Browser:

  • track: UCSC Genes
  • table: knownCanonical
  • output format: select fields from primary and related tables
  • press get output
  • select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,
  • transcript select fields from hg19.kgXref: geneSymbol
  • press get output

The file UCSC_canonical.bed looks like:

#hg19.knownCanonical.chrom      hg19.knownCanonical.chromStart  hg19.knownCanonical.chromEnd    hg19.knownCanonical.transcript  hg19.kgXref.geneSymbol
chr1    11873   14409   uc010nxq.1      DDX11L1
chr1    14361   19759   uc009viu.3      WASH7P
chr1    14406   29370   uc009viw.2      WASH7P
chr1    34610   36081   uc001aak.3      FAM138F
chr1    69090   70008   uc001aal.1      OR4F5
chr1    134772  140566  uc021oeg.2      LOC729737
chr1    321083  321115  uc001aaq.2      DQ597235
chr1    321145  321207  uc001aar.2      DQ599768
chr1    322036  326938  uc009vjk.2      LOC100133331

Download a bed file for all UCSC exons using UCSC Table Browser:

  • track: UCSC Genes
  • table: knownGene
  • output format: BED - browser extensible data
  • press get output
  • select option Exons
  • press get BED

The file UCSC_exons.bed looks like that:

chr1    11873   12227   uc001aaa.3_exon_0_0_chr1_11874_f        0       +
chr1    12612   12721   uc001aaa.3_exon_1_0_chr1_12613_f        0       +
chr1    13220   14409   uc001aaa.3_exon_2_0_chr1_13221_f        0       +
chr1    11873   12227   uc010nxr.1_exon_0_0_chr1_11874_f        0       +
chr1    12645   12697   uc010nxr.1_exon_1_0_chr1_12646_f        0       +
chr1    13220   14409   uc010nxr.1_exon_2_0_chr1_13221_f        0       +
chr1    11873   12227   uc010nxq.1_exon_0_0_chr1_11874_f        0       +
chr1    12594   12721   uc010nxq.1_exon_1_0_chr1_12595_f        0       +
chr1    13402   14409   uc010nxq.1_exon_2_0_chr1_13403_f        0       +
chr1    14361   14829   uc009vis.3_exon_0_0_chr1_14362_r        0       -

Modify the file to separate the transcript name of the rest of information:

awk '{split ($4,a,"_"); {print $1"\t"$2"\t"$3"\t"a[1]"\t"a[3]"\t"$6}}' UCSC_exons.bed > UCSC_exons_modif.bed

The file UCSC_exons_modif.bed:

chr1    11873   12227   uc001aaa.3      0       +
chr1    12612   12721   uc001aaa.3      1       +
chr1    13220   14409   uc001aaa.3      2       +
chr1    11873   12227   uc010nxr.1      0       +
chr1    12645   12697   uc010nxr.1      1       +
chr1    13220   14409   uc010nxr.1      2       +
chr1    11873   12227   uc010nxq.1      0       +
chr1    12594   12721   uc010nxq.1      1       +
chr1    13402   14409   uc010nxq.1      2       +
chr1    14361   14829   uc009vis.3      0       -

Join the sorted files based on the transcript identificator:

join -1 4 -2 4 <(sort -k4 UCSC_exons_modif.bed ) <(sort -k4 UCSC_canonical.bed) | awk '{print $2"\t"$3"\t"$4"\t"$10"\t"$5"\t"$6}' | bedtools sort -i "-" > UCSC_exons_modif_canonical.bed

The final file contains exons of the canonical transcripts:

chr1    11873   12227   DDX11L1 0       +
chr1    12594   12721   DDX11L1 1       +
chr1    13402   14409   DDX11L1 2       +
chr1    14361   14829   WASH7P  0       -
chr1    14406   16765   WASH7P  0       -
chr1    14969   15038   WASH7P  1       -
chr1    15795   15947   WASH7P  2       -
chr1    16606   16765   WASH7P  3       -
chr1    16857   17055   WASH7P  4       -
chr1    16857   17055   WASH7P  1       -
ADD COMMENT
2
Entering edit mode

At this step: transcript select fields from hg19.kgXref: geneSymbol, one need to check kgID in that hg19.kgXref table to produce the expected UCSC_canonical.bed.

Another important thing: exon numeration is always forward! So if the gene is in reverse complement strand than Exon 0 there is the last exon.

ADD REPLY
0
Entering edit mode

Pulling this thread again since I got this error:

header$ join -1 4 -2 4 <(sort -k4 UCSC_exons_modif.bed ) <(sort -k4 UCSC_canonical.bed) | awk '{print $2"\t"$3"\t"$4"\t"$10"\t"$5"\t"$6}' | bedtools sort -i "-" > UCSC_exons_modif_canonical.bed
join: /dev/fd/62:7064: is not sorted: chrX    155255322    155257848    DDX11L family member
ADD REPLY
0
Entering edit mode

Did you solve this issue?

ADD REPLY
1
Entering edit mode

I found doing the following worked to reproduce the output as shown by pristanna, I made just one minor change to the method pristanna detailed:

Download a bed file for the canonical transcripts using UCSC Table Browser:

track: UCSC Genes table: knownCanonical output format: select fields from primary and related tables press get output select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,transcript transcript select fields from hg19.kgXref: geneSymbol press get output

ADD REPLY
0
Entering edit mode

I had the same problem. Just download again both bed from UCSC. Probably you need to wait and confirm the download.

ADD REPLY
0
Entering edit mode

Hi pristanna

Pulling this thread once again for a question:

Why the file "UCSC_canonical.bed" contains all the transcripts of a gene with multiple isoforms? Should'nt a canonical bed contain the "longest one" ? For example, here for the this gene "WASH7P", this file has 2 transcripts; uc009viu.3 and uc009viw.2. What I wanted (or I thought rather) is that it should have the canonical (which is mostly the longest; though there are varied views on that).

chr1    14361   19759   uc009viu.3      WASH7P
chr1    14406   29370   uc009viw.2      WASH7P

If it is expected to contain all the isoforms, then , is there any way from "UCSC" to get ONLY the longest one?

ADD REPLY
0
Entering edit mode

I happened to write a custom python script to fetch the canonical transcript.

ADD REPLY
0
Entering edit mode

There is a consequential error in you response.

From your answer:
- select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,
- transcript select fields from hg19.kgXref: geneSymbol

Should be:
- select fields from hg19.knownCanonical: chrom, chromStart, chromEnd, transcript
- select fields from hg19.kgXref: geneSymbol

Once I figured that out, it worked perfectly. Thank you!

ADD REPLY
8
Entering edit mode
5.2 years ago
kdauria ▴ 80

To get directly from ensembl database...

mysql -h ensembldb.ensembl.org -u anonymous -P 3306 homo_sapiens_core_98_38

Then

SELECT DISTINCT
    seq_region.name AS chrom,
    exon.seq_region_start AS start,
    exon.seq_region_end AS end,
    gene_names.display_label AS gene_name,
    exon_transcript.rank AS exon_number,
    exon.seq_region_strand AS strand
FROM gene
JOIN transcript ON transcript.transcript_id = gene.canonical_transcript_id
JOIN exon_transcript ON exon_transcript.transcript_id = transcript.transcript_id
JOIN exon ON exon_transcript.exon_id = exon.exon_id
JOIN seq_region ON seq_region.seq_region_id = gene.seq_region_id
LEFT JOIN (
    SELECT xref_id, display_label
    FROM xref
    WHERE xref.external_db_id = 1100
) AS gene_names ON gene_names.xref_id = gene.display_xref_id
WHERE gene_names.display_label = 'TP53' AND
      LEFT(gene.stable_id, 4) <> 'LRG_'
ORDER BY seq_region.name, exon.seq_region_start;

Output for this example query:

+-------+---------+---------+-----------+-------------+--------+
| chrom | start   | end     | gene_name | exon_number | strand |
+-------+---------+---------+-----------+-------------+--------+
| 17    | 7668402 | 7669690 | TP53      |          11 |     -1 |
| 17    | 7670609 | 7670715 | TP53      |          10 |     -1 |
| 17    | 7673535 | 7673608 | TP53      |           9 |     -1 |
| 17    | 7673701 | 7673837 | TP53      |           8 |     -1 |
| 17    | 7674181 | 7674290 | TP53      |           7 |     -1 |
| 17    | 7674859 | 7674971 | TP53      |           6 |     -1 |
| 17    | 7675053 | 7675236 | TP53      |           5 |     -1 |
| 17    | 7675994 | 7676272 | TP53      |           4 |     -1 |
| 17    | 7676382 | 7676403 | TP53      |           3 |     -1 |
| 17    | 7676521 | 7676622 | TP53      |           2 |     -1 |
| 17    | 7687377 | 7687538 | TP53      |           1 |     -1 |
+-------+---------+---------+-----------+-------------+--------+
11 rows in set (6.66 sec)
ADD COMMENT
0
Entering edit mode

Hello, it's been four years, I hope you're still around. I copy-pasted your SQL code and removed "gene_names.display_label = 'TP53' AND". It's been a long time since I did SQL but I think this is the minimal edit to output the whole shebang, so to speak. Is this correct? The output looks alright although some percentage of lines have "NULL" in their gene_name column.

I also changed 98 to 109, because that's a newer Ensembl version.

ADD REPLY
0
Entering edit mode

The world has changed since then. MANE exists now. Why not just use the Gencode GFF file and grep for the "MANE" tag? Not sure if Mysql is the easiest solution here.

Or download the plain MANE table from UCSC, it's in genePred format, so you can run the tools genePredToBed and BedToExons to get the exon positions.

ADD REPLY
0
Entering edit mode

Hey Max, thanks for helping. I'm vaguely familiar with MANE. The MANE table at UCSC contains enough information (assuming "block size" means exon length) to answer the question but it'll take some coding to get it into a nice format for my purposes. I suppose it's worth it. I don't know what those two programs you mention are. Do you mean e.g. https://github.com/ENCODE-DCC/kentUtils/blob/master/src/utils/genePredToBed/genePredToBed ? They seem very old...

ADD REPLY
0
Entering edit mode

Those programs do the format conversions mentioned in comment above. You should download them from the official repo here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ (linux link). macOS version also available at the same exe location.

ADD REPLY
0
Entering edit mode

Thanks GenoMax! I tried running BigBedToBed and then BedToExons but it complains about the ENST column. I'll just do it manually with a Python script...

ADD REPLY
1
Entering edit mode

I don't know what exactly you need (sequence or exon/intron info) but you can find some of the relevant downloads here: https://www.ncbi.nlm.nih.gov/refseq/MANE/#accessing-mane-select-data

ADD REPLY
0
Entering edit mode

Oh wow, that's actually cool. I can use the gff file. It's a newer version (1.2) than UCSC has among their tables (1.0). I'll still need some scripting to get the format right but, worth it. Thanks GenoMax!

ADD REPLY
3
Entering edit mode
10.9 years ago

Perhaps apply a BEDOPS bedmap map operation, mapping BioMart-sourced exons to UCSC-sourced canonical transcripts. You could do something like:

$ bedmap --echo --echo-map-id-uniq ucsc_canonical.sorted.bed biomart_exons.sorted.bed > answer.bed

The --echo operator prints each canonical element. The --echo-map-id-uniq operator prints the unique ID name(s) for exons whose genomic ranges span across the canonical element's genomic range.

If you want all the exon information, ranges and all, use --echo-map in place of --echo-map-id-uniq.

The default overlap criteria is one or more bases between map (exon) and reference (canonical) element - you can adjust this stringency up to complete overlap.

It looks like you'd need to pre-process the BioMart data into UCSC BED format (renaming chromosomes, at least). Prepending the chromosome index with the string chr can be done with a simple awk statement:

$ awk '{print "chr"$0}' biomart_stuff.bed > biomart_stuff_with_UCSC_chromosome_names.bed

If necessary, also sort the BED data:

$ sort-bed ucsc_canonical.bed > ucsc_canonical.sorted.bed
$ sort-bed biomart_exons.bed > biomart_exons.sorted.bed

Do this once, if your data are of uncertain sort order. This lets you take advantage of BEDOPS speed and memory enhancements. You only need to sort input sets once.

ADD COMMENT
0
Entering edit mode

Thanks a lot Alex for the idea, I am not familiar with BEDOPS, but I will try, it looks interesting!

ADD REPLY
0
Entering edit mode

Let us know if you have any questions!

ADD REPLY
0
Entering edit mode

Finally I solved it using different approach, but for another task I need some tool which can work with the overlap criteria, so I am going to try the BEDOPS now! Thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 1526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6