Hi everyone.
I downloaded from UCSC a bed file containing all refseq exons. However, USCS numbers them according to genomic order, so a transcript on the reverse strand would have has his last exon numbered with index 0 instead of the last index.
Can anyone point me to how I can obtain a file where exons are numbered acording to their position on the transcript?
As an example SF3B1
is represented on USCS bed file like:
chr2 198256697 198257185 NM_012433_exon_0_0_chr2_198256698_r 0 -
chr2 198257695 198257912 NM_012433_exon_1_0_chr2_198257696_r 0 -
chr2 198260779 198261052 NM_012433_exon_2_0_chr2_198260780_r 0 -
chr2 198262708 198262840 NM_012433_exon_3_0_chr2_198262709_r 0 -
chr2 198263184 198263305 NM_012433_exon_4_0_chr2_198263185_r 0 -
chr2 198264778 198264890 NM_012433_exon_5_0_chr2_198264779_r 0 -
chr2 198264975 198265158 NM_012433_exon_6_0_chr2_198264976_r 0 -
chr2 198265438 198265660 NM_012433_exon_7_0_chr2_198265439_r 0 -
chr2 198266123 198266249 NM_012433_exon_8_0_chr2_198266124_r 0 -
chr2 198266465 198266612 NM_012433_exon_9_0_chr2_198266466_r 0 -
chr2 198266708 198266854 NM_012433_exon_10_0_chr2_198266709_r 0 -
chr2 198267279 198267550 NM_012433_exon_11_0_chr2_198267280_r 0 -
chr2 198267672 198267759 NM_012433_exon_12_0_chr2_198267673_r 0 -
chr2 198268308 198268488 NM_012433_exon_13_0_chr2_198268309_r 0 -
chr2 198269799 198269901 NM_012433_exon_14_0_chr2_198269800_r 0 -
chr2 198269998 198270196 NM_012433_exon_15_0_chr2_198269999_r 0 -
chr2 198272721 198272843 NM_012433_exon_16_0_chr2_198272722_r 0 -
chr2 198273092 198273305 NM_012433_exon_17_0_chr2_198273093_r 0 -
chr2 198274493 198274731 NM_012433_exon_18_0_chr2_198274494_r 0 -
chr2 198281464 198281635 NM_012433_exon_19_0_chr2_198281465_r 0 -
chr2 198283232 198283312 NM_012433_exon_20_0_chr2_198283233_r 0 -
chr2 198285151 198285266 NM_012433_exon_21_0_chr2_198285152_r 0 -
chr2 198285752 198285857 NM_012433_exon_22_0_chr2_198285753_r 0 -
chr2 198288531 198288698 NM_012433_exon_23_0_chr2_198288532_r 0 -
chr2 198299695 198299771 NM_012433_exon_24_0_chr2_198299696_r 0 -
What I would require would be something like :
chr2 198256697 198257185 NM_012433_exon_25 0 -
chr2 198257695 198257912 NM_012433_exon_24 0 -
chr2 198260779 198261052 NM_012433_exon_23 0 -
chr2 198262708 198262840 NM_012433_exon_22 0 -
chr2 198263184 198263305 NM_012433_exon_21 0 -
chr2 198264778 198264890 NM_012433_exon_20 0 -
chr2 198264975 198265158 NM_012433_exon_19 0 -
chr2 198265438 198265660 NM_012433_exon_18 0 -
chr2 198266123 198266249 NM_012433_exon_17 0 -
chr2 198266465 198266612 NM_012433_exon_16 0 -
chr2 198266708 198266854 NM_012433_exon_15 0 -
chr2 198267279 198267550 NM_012433_exon_14 0 -
chr2 198267672 198267759 NM_012433_exon_13 0 -
chr2 198268308 198268488 NM_012433_exon_12 0 -
chr2 198269799 198269901 NM_012433_exon_11 0 -
chr2 198269998 198270196 NM_012433_exon_10 0 -
chr2 198272721 198272843 NM_012433_exon_9 0 -
chr2 198273092 198273305 NM_012433_exon_8 0 -
chr2 198274493 198274731 NM_012433_exon_7 0 -
chr2 198281464 198281635 NM_012433_exon_6 0 -
chr2 198283232 198283312 NM_012433_exon_5 0 -
chr2 198285151 198285266 NM_012433_exon_4 0 -
chr2 198285752 198285857 NM_012433_exon_3 0 -
chr2 198288531 198288698 NM_012433_exon_2 0 -
chr2 198299695 198299771 NM_012433_exon_1 0 -
Can someone help? I know I could write a script to convert these but I have trouble believing there isn't a source our there already containing the information in this format... However my google skills seem inadequate because I have failed to find them
Many thanks
Duarte
Thanks Devon
I knew I could code it as well... I personaly prefer to do it in awk and a couple if bash utils :) but I was wondering if there would be a ready made resource where I could get this that I might not be aware of :)
Seems to be to be the logical numbering way of ordering exons... genomic index order you can always obtain from the coordinates themselves :)
Many thanks
Duarte
PS... in case you want to know how I did it, assuming you have a bed file with the refseq exons like the ones I indicated in the post then you can just do 2 awk commands:
For the reverse strand transcripts (exons on the reverse strand - invert indexing)
For the forward strand transcripts (exons on the same as genomic order)
You might try the Ensembl annotations instead. They tend to be much more coherently formatted.
Ensembl Exons are not linked to any particular transcript so they do not have order numbers
Sure they do, though at least the mouse annotation has the same numbering annoyance.
Where can you get the order number? I have been using the perl API and I cannot find any reference to the order number.
I believe that ensembl does not link each exon to a specific transcript (because many transcripts share 1 or more exons) so it does not make sense to add a order number to each exon.
But If I am wrong would you mind sharing with me how I can access that information ? It would be very useful for me
Just download the annotation GTF file rather than trying to use the perl API. The annotation file itself will have multiple transcripts with the exons for each. If an exon is in multiple transcripts, it'll appear once for each.
BTW, nice awk solution.