A set operation approach would require splitting out exons per transcript, which would work but may involve many thousands of small disk operations, which cumulatively may be very slow.
One way to do this would be to first obtain the genes:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "gene"' \
| convert2bed -i gff - \
> genes.bed
Make a second file containing exons:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "exon"' \
| convert2bed -i gff - \
> exons.bed
Pull gene IDs from the genes.bed
file, preserving the order they are discovered in the genes file:
$ cut -f4 genes.bed > ordered_ids.txt
Use a script to walk through this ID list and the exons file in order, to write a set of introns to standard output:
#!/usr/bin/env perl
use strict;
use warnings;
my $id_fn = "ordered_ids.txt";
my $exons_fn = "exons.bed";
open my $id_fh, "<", $id_fn or die "could not open ID file\n";
open my $exons_fh, "<", $exons_fn or die "could not open exons file\n";
my $result;
my $query_id = <$id_fh>;
while (my $exon_ln = <$exons_fh>) {
chomp $exon_ln;
my @exon_elems = split("\t", $exon_ln);
my $exon_id = $exon_elems[3];
if ($exon_id =~ /$query_id/) {
# push another exon into result hash table
my $exon_chr = $exon_elems[0];
my $exon_start = $exon_elems[1];
my $exon_stop = $exon_elems[2];
$result->{$query_id}->{chr} = $exon_chr;
push @{$result->{$query_id}->{starts}), $exon_start;
push @{$result->{$query_id}->{stops}), $exon_stop;
}
else {
# write out introns from gaps between previous exon stop and current exon start
if (defined $result->{$query_id}->{chr}) {
my $result_chr = $result->{$query_id}->{chr};
my @result_starts = @{$result->{$query_id}->{starts};
my @result_stops = @{$result->{$query_id}->{stops};
my $result_exon_count = scalar @result_starts;
for (my $exon_idx = 1; $exon_idx < $result_exon_count - 1; $exon_idx++) {
my $result_start = $result_stops[$exon_idx - 1]; # previous stop
my $result_stop = $result_starts[$exon_idx]; # current start
print STDOUT "$result_chr\t$result_start\t$result_stop\t$query_id\n";
}
}
$query_id = <$id_fh>; # grab next ID to query
}
}
close $id_fh;
close $exons_fh;
I think this approach should be faster, because instead of making lots of tiny files to do set operations on, this steps through both files in order only once, printing out introns as their bounds are found.
While exons.bed
is in sort-bed
order, it isn't guaranteed that the output from this Perl script is in sort-bed
order. So just pipe it in:
$ make_introns_from_exons_and_ids.pl | sort-bed - > introns.bed
You can get intronic coordinates for specific transcript using bedtools subtract (subtract exons coordinates from transcript coordinates).