Remove overlap between Exon and UTR boundaries
2
0
Entering edit mode
5.2 years ago

I want to remove the overlapped region between exonic and UTR (5UTR and 3UTR) regions to keep the only exonic region.

Trial version of Input dataset is:

Chr Strand  Exon_ID Exon_start  Exon_end    5UTR_start  5UTR_end    3UTR_start  3UTR_end
1   1   AT1G01010.1.exon1   3631    3913    3631    3759    0   0
1   1   AT1G01010.1.exon2   3996    4276    0   0   0   0
1   1   AT1G01010.1.exon6   5439    5899    0   0   5631    5899
1   -1  AT1G01020.1.exon1   8571    9130    8667    9130    0   0
**1 -1  AT1G01060.7.exon8   33662   34327   0   0   33662   33991
1   -1  AT1G01030.1.exon2   11649   12940   12941   13173   11649   11863**

For Instance, for exon (AT1G01010.1.exon1), the Exon strat (3631) and 5UTR start (3631) both starts from the same position (3631), 5UTR region ends at 3759 while exon ends at 3913 since there is an overlap of 128 base pairs (3759-3631) so to keep only exonic region I want to change exonic start as 3760 and exonic end will remain same. But for exon (AT1G01020.1.exon1), the Exon strat (8571) and 5UTR start (8667) both ends at the same position (9130), but there is an overlap of 463 base pairs (9130-8667) so to keep only exonic region I want to change exonic end as 8666 and exonic start will remain same.

Final output should be like this:

Chr Strand  Exon_ID Exon_start  Exon_end    5UTR_start  5UTR_end    3UTR_start  3UTR_end
1   1   AT1G01010.1.exon1   3760    3913    3631    3759    0   0
1   1   AT1G01010.1.exon2   3996    4276    0   0   0   0
1   1   AT1G01010.1.exon6   5439    5630    0   0   5631    5899
1   -1  AT1G01020.1.exon1   8571    8666    8667    9130    0   0
**1 -1  AT1G01060.7.exon8   33992   34327   0   0   33662   33991
1   -1  AT1G01030.1.exon2   11864   12940   12941   13173   11649   11863**

I have tried a few awk commands but wasn't able to get the desired output, Any help will be highly appreciated.

RNA-Seq exon UTR • 2.1k views
ADD COMMENT
4
Entering edit mode

UTRs are exonic. If you remove UTRs from the exonic regions, the regions you are left with are not exons. If you are doing this for standard DE analysis of RNAseq, it would be very bad practice to remove the UTRs, as UTRs make up on average 30% of a transcript, and often more than 50% of a transcript, so by removing them you are removing 30-50% of your data.

ADD REPLY
4
Entering edit mode
5.2 years ago
JC 13k

Actually your modifications are better to have in a script, here is my solution in Perl:

#!/usr/bin/perl

use strict;
use warnings;

while (<>) {
    if (/^Chr/) {
        print;
        next;
    }
    chomp;
    my ($chr,$dir,$exid,$eini,$eend,$uini5,$uend5,$uini3,$uend3) = split (/\s+/, $_);
    if ($dir == 1) {
        if ($uend5 > 0) {
            print join "\t", $chr,$dir,$exid,$uend5+1,$eend,$uini5,$uend5,$uini3,$uend3;
        } elsif ($uini3 > 0) {
            print join "\t", $chr,$dir,$exid,$eini,$uini3-1,$uini5,$uend5,$uini3,$uend3;
        } else {
            print;
        }
    } else { # $dir == -1
        if ($uini5 > 0) {
             print join "\t", $chr,$dir,$exid,$eini,$uini5-1,$uini5,$uend5,$uini3,$uend3;
        } elsif ($uend5 > 0) {
             print join "\t", $chr,$dir,$exid,$uend3+1,$eend,$uini5,$uend5,$uini3,$uend3;
        } else {
             print;
        }
    }
    print "\n";
}

Testing it:

$ perl exon.pl < coords.txt
Chr Strand  Exon_ID Exon_start  Exon_end    5UTR_start  5UTR_end    3UTR_start  3UTR_end
1   1   AT1G01010.1.exon1   3760    3913    3631    3759    0       0
1   1   AT1G01010.1.exon2   3996    4276    0   0   0   0
1   1   AT1G01010.1.exon3   4486    4605    0   0   0   0
1   1   AT1G01010.1.exon4   4706    5095    0   0   0   0
1   1   AT1G01010.1.exon5   5174    5326    0   0   0   0
1   1   AT1G01010.1.exon6   5439    5630    0       0       5631    5899
1  -1   AT1G01020.1.exon1   8571    8666    8667    9130    0       0

And as a comment, Exon includes UTRs and CDS, so I guess what you want is to define your CDS and UTRs.

ADD COMMENT
0
Entering edit mode

@JC Many thanks for your help, the above solution worked very value for the trail dataset which I have provided but in the actual dataset I faced a problem where we need to change boundaries of CDS region with respect to UTR regdardless of strand specificness, I have modified the dataset and make it bold so that you can see it, I have also attached file containing values that don't change, I wasn't able to modify your script as I don't have much knowledge about perl, so I will highly appreciate if you please modify it.

ADD REPLY
1
Entering edit mode

I believe that error is because in line 25 it should be elsif ($uend3 > 0)

ADD REPLY
0
Entering edit mode

Sorry It was my mistake, i have modified the script but was running the old script, yes its a real data and now the only thing is to address this problem: 1 -1 AT1G01030.1.exon2 11649 12940 12941 13173 11649 11863

ADD REPLY
0
Entering edit mode

is this real data? this line:

1   -1  AT1G01030.1.exon2   11649   12940   12941   13173   11649   11863

means Exon #2 contains a 5' UTR and a 3' UTR which is unlikely

ADD REPLY
0
Entering edit mode

Just means its a one exon gene. Over 1000 genes in the human genome have only one exon, but the number is bigger in Arabidopsis, where splicing is simpler (or just less well annotated).

ADD REPLY
0
Entering edit mode

There are 5393 cases where we have both 3 and 5 UTR regions, please see the file

ADD REPLY
2
Entering edit mode
5.2 years ago
JC 13k

Under the new considerations (5'UTR and 3'UTR in exon) the script needs to be:

#!/usr/bin/perl

use strict;
use warnings;

while (<>) {
    if (/^Chr/) {
        print;
        next;
    }
    chomp;
    my ($chr,$dir,$exid,$eini,$eend,$uini5,$uend5,$uini3,$uend3) = split (/\s+/, $_);
    if ($dir == 1) {
        if ($uend5 > 0) {
            $eini = $uend5 + 1;
        }
        if ($uini3 > 0) {
            $eend = $uini3 - 1;
        }
    } else { # $dir == -1
        if ($uini5 > 0) {
             $eend = $uini5 - 1;
        }
        if ($uend3 > 0) {
             $eini = $uend3 + 1;
        }
    }
    print join "\t", $chr, $dir, $exid, $eini, $eend, $uini5, $uend5, $uini3, $uend3;
    print "\n";
}

test:

$ perl exon2.pl < coords.txt
Chr     Strand  Exon_ID Exon_start      Exon_end        5UTR_start      5UTR_end        3UTR_start      3UTR_end
1       1       AT1G01010.1.exon1       3760    3913    3631    3759    0       0
1       1       AT1G01010.1.exon2       3996    4276    0       0       0       0
1       1       AT1G01010.1.exon6       5439    5630    0       0       5631    5899
1       -1      AT1G01020.1.exon1       8571    8666    8667    9130    0       0
1       -1      AT1G01060.7.exon8       33992   34327   0       0       33662   33991
1       -1      AT1G01030.1.exon2       11864   12940   12941   13173   11649   11863

Please test it with your full dataset, hope this works for you.

ADD COMMENT

Login before adding your answer.

Traffic: 2469 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6