This is kind of a coding strategy question.
For given gene, it has two isoforms with 3 exons. Isoform_A is exon1-exon2-exo3, while IsoformB is exon1-exon3. Thus, the exon2 here is what I want to filter out, as internal exon.
Now I have downloaded all the exon data from UCSC genome browser UCSC genes track (selected from primary and related fields). And I just want to filter out all the "internal exon" in this question.
The input is somehow like:
#isoform_name chr strand ex_start ex_end gene_name
isoformA chr1 + 10,30, 15,35 geneM
isoformB chr1 + 10,20,30, 15,25,35 geneM
isoformC chr1 + 40,50, 45,55 geneM
Thus the exon [20-25] is called the internal exon.
The key is to deal with two string, exstart string and exend string. Can anyone provide some hint about how to cope with this issue efficiently?
p.s. I have known HEXEvent and BioMart can provide such data set. But I am just curious how to do it with local codes? Thanks a lot!
Please, why are there more exstart and exend values provided?
isoformC has a missing comma in
ex_start
missing comma added like @JC mentioned