So I'm attempting to get some information for hg19 exons:
I need chr, start, end, strand, and gene name
I was able to do this via UCSC table browser -> refseq -> refGene -> selected field from primary and related tables:
I choose the following fields:
chrom, strand, exonStarts, exonEnds, name2
and this gives me exactly what I need except that it gives me multiple exonStarts and exonEnds on the same row and thus I'm not able to run it in typical programs I use (bedtools etc).
I know that it's possible to separate out these start sites and end sites into separate rows using something like awk, but after spending a bit of time (Obtaining Exon Lengths:) trying to figure it out, I can't seem to do it.
Was hoping somebody could tell me what to do to separate out these multiple exon starts and ends into different rows and remove duplicate start and end sites (for instance the first two rows have similar exon start and end sites).
Thank you!
I've looked into this as well, the problem is that I need the gene name information which is why I didn't consider this. Is it possible to annotate this bed file and then perhaps use that (after removing unnecessary columns)?
Basically i'll be taking this bed file as a text file and using it to overlap against Pol II chip-seq peaks.
I thought you could probably get that with GTF but UCSC is using the transcript name for gene game in GTF format. Best I can think of is to annotate the BED file using a script which uses a look up table/hash for each transcript name to gene name and add it to the last column of the BED file.
Gotcha. I think I can do that. Thanks.