I want to remove the overlapped region between exonic and UTR (5UTR and 3UTR) regions to keep the only exonic region.
Trial version of Input dataset is:
Chr Strand Exon_ID Exon_start Exon_end 5UTR_start 5UTR_end 3UTR_start 3UTR_end
1 1 AT1G01010.1.exon1 3631 3913 3631 3759 0 0
1 1 AT1G01010.1.exon2 3996 4276 0 0 0 0
1 1 AT1G01010.1.exon6 5439 5899 0 0 5631 5899
1 -1 AT1G01020.1.exon1 8571 9130 8667 9130 0 0
**1 -1 AT1G01060.7.exon8 33662 34327 0 0 33662 33991
1 -1 AT1G01030.1.exon2 11649 12940 12941 13173 11649 11863**
For Instance, for exon (AT1G01010.1.exon1), the Exon strat (3631) and 5UTR start (3631) both starts from the same position (3631), 5UTR region ends at 3759 while exon ends at 3913 since there is an overlap of 128 base pairs (3759-3631) so to keep only exonic region I want to change exonic start as 3760 and exonic end will remain same. But for exon (AT1G01020.1.exon1), the Exon strat (8571) and 5UTR start (8667) both ends at the same position (9130), but there is an overlap of 463 base pairs (9130-8667) so to keep only exonic region I want to change exonic end as 8666 and exonic start will remain same.
Final output should be like this:
Chr Strand Exon_ID Exon_start Exon_end 5UTR_start 5UTR_end 3UTR_start 3UTR_end
1 1 AT1G01010.1.exon1 3760 3913 3631 3759 0 0
1 1 AT1G01010.1.exon2 3996 4276 0 0 0 0
1 1 AT1G01010.1.exon6 5439 5630 0 0 5631 5899
1 -1 AT1G01020.1.exon1 8571 8666 8667 9130 0 0
**1 -1 AT1G01060.7.exon8 33992 34327 0 0 33662 33991
1 -1 AT1G01030.1.exon2 11864 12940 12941 13173 11649 11863**
I have tried a few awk commands but wasn't able to get the desired output, Any help will be highly appreciated.
UTRs are exonic. If you remove UTRs from the exonic regions, the regions you are left with are not exons. If you are doing this for standard DE analysis of RNAseq, it would be very bad practice to remove the UTRs, as UTRs make up on average 30% of a transcript, and often more than 50% of a transcript, so by removing them you are removing 30-50% of your data.