Question

Multiple Exon Start and End sites in UCSC Exons table

0

Entering edit mode

9.8 years ago

dally ▴ 210

So I'm attempting to get some information for hg19 exons:

I need chr, start, end, strand, and gene name

I was able to do this via UCSC table browser -> refseq -> refGene -> selected field from primary and related tables:

I choose the following fields:

chrom, strand, exonStarts, exonEnds, name2

and this gives me exactly what I need except that it gives me multiple exonStarts and exonEnds on the same row and thus I'm not able to run it in typical programs I use (bedtools etc).

I know that it's possible to separate out these start sites and end sites into separate rows using something like awk, but after spending a bit of time (Obtaining Exon Lengths:) trying to figure it out, I can't seem to do it.

Was hoping somebody could tell me what to do to separate out these multiple exon starts and ends into different rows and remove duplicate start and end sites (for instance the first two rows have similar exon start and end sites).

Thank you!

ucsc awk • 2.7k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by dally ▴ 210

Ram · Accepted Answer · 2015-10-21

0

Entering edit mode

9.8 years ago

Vivek ★ 2.7k

You an export the data directly in BED from the table browser which is more convenient for downstream analysis, you'll end up with the following format:

chr1 66999638 67000051 NM_032291_exon_0_0_chr1_66999639_f 0 +
chr1 67091529 67091593 NM_032291_exon_1_0_chr1_67091530_f 0 +
chr1 67098752 67098777 NM_032291_exon_2_0_chr1_67098753_f 0 +
chr1 67101626 67101698 NM_032291_exon_3_0_chr1_67101627_f 0 +

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Vivek ★ 2.7k

0

Entering edit mode

I've looked into this as well, the problem is that I need the gene name information which is why I didn't consider this. Is it possible to annotate this bed file and then perhaps use that (after removing unnecessary columns)?

Basically i'll be taking this bed file as a text file and using it to overlap against Pol II chip-seq peaks.

ADD REPLY • link 9.8 years ago by dally ▴ 210

0

Entering edit mode

I thought you could probably get that with GTF but UCSC is using the transcript name for gene game in GTF format. Best I can think of is to annotate the BED file using a script which uses a look up table/hash for each transcript name to gene name and add it to the last column of the BED file.