Hi,
I have this data from where I have to remove those transcripts of a gene which have overlapping or common exons. Third column of my file has the start coordinates of all exons of a transcript and fourth column has exon end coordinates. In most of the cases there are multiple exons for each transcript separated by a semi-colon. For example:-
ENSG00000004399 ENST00000512744 129275460;129277271;129275926;129274061 129275534;129277364;129276066;129275271
ENSG00000004399 ENST00000393239 129275926;129274018;129277271;129275460 129276066;129275271;129277364;129275534
ENSG00000004399 ENST00000505665 129302968;129302474 129303067;129302512
ENSG00000001167 ENST00000353205 41065150 41065689
ENSG00000001167 ENST00000341376 41065150 41067715
The output after removing redundancy will be only those transcripts that have longest,non redundant exons.
ENSG00000004399 ENST00000393239 129275926;129274018;129277271;129275460 129276066;129275271;129277364;129275534
ENSG00000004399 ENST00000505665 129302968;129302474 129303067;129302512
ENSG00000001167 ENST00000341376 41065150 41067715
Can anyone suggest how to get this result? Thanks in advance.
Given that you already have an example, is this homework? What did you try so far?
What exactly are you trying to accomplish? Merging isoforms into a gene structure?
Micheal, I have removed those transcripts with common and overlapping exons which had single exons. Now trying to deal with this list where there are multiple exons but then thought of asking some help here. This is not homework anyways..
Hi DK, I am trying to remove those transcripts from my list where the exons are already present in another transcript. E.g removing entries like ENST00000512744 and ENST00000353205 as the exons in these transcripts are already covered by other transcripts from the above list. Thus removing redundancy from my data.
In the example you posted, you removed the first transcript (512744) and kept the second transcript (393239). The first exon of the first transcript is: 129275460 - 129275534. The first exon of the second transcript is: 129275926 - 129276066. Why did you decide to remove the first one in that case? The first transcript exon is has an extra ~500 bases upstream of the second transcript exon.
In the example you posted, you removed the first transcript (512744) and kept the second transcript (393239). The first exon of the first transcript is: 129275460 - 129275534. The first exon of the second transcript is: 129275926 - 129276066. Why did you decide to remove the first one in that case? The first transcript exon has an extra ~500 bases upstream of the second transcript exon.
The exons are not sorted in ascending order. As you see, 129275460 is also present in second transcript (393239). I selected this one because the second exon 129274018 in this transcript ends at 129275271, whereas in first one (512744), it starts at 129274061 and ends at same position 129275271.
Ahh, I see. My mistake. So you just want to remove transcripts that are completely within another transcript.
yes,I think I shd have explained my question a bit better:)
Is there data on what gene and reference contig the transcript belongs to? What are the first and second columns of your data?
First column is the gene id and second is transcript id. The transcript ids belong to their respective geneids. In the above list, first three transcripts belong to same geneid.