Perl script: generate pseudo-CDSs from intergenic region
2
0
Entering edit mode
10.0 years ago
biolab ★ 1.4k

Hi everyone

Recently I read a paper (Genetica (2009) 137:159-164), in which the authors generate pseudo-CDSs (in other words, negative dataset) based on intergenic sequences. They used an in-house script to randomly exact sequences from intergenic region. These pseudo-CDSs have the same number of sequences and a similar length distribution as genuine CDSs.

I am wondering if someone can share this kind of perl script with me. I appreciate your kind helps. Any comments (e.g. available tools) will be also helpful. Thank you very much!

perl • 4.3k views
ADD COMMENT
3
Entering edit mode
10.0 years ago
PoGibas 5.1k

I use bedtools shuffle for this, input:

  • CDS.bed (bedtools shuffle will reposition each feature in the input BED file on a random chromosome at a random position. The size and strand of each feature are preserved).
  • Genome (chromosome sizes).
  • Regions to exclude (in your case this will be protein coding gene coordinates (as you want negative dataset). Also see my question: Genomic Regions To Exclude Before Shuffling Intervals).

I would also recommend to use -chrom option.

ADD COMMENT
0
Entering edit mode

Hi Pgibas, your answer is really helpful. Thanks a lot!

ADD REPLY
2
Entering edit mode
9.8 years ago
microbe77 ▴ 30

Hi there,

I wrote a script in Perl (hereunder). It's very simple. You need to copy and past it in a text file (use gedit or vim to make the file, gedit is very simple) and name the file like gbk2intergenic.pl

#AUTHOR: MAHMOUD AL-BASSAM 02/25/2015 UCSD
#Intergenic sequence extraction script from any bacterial GenBank file
#The script takes into account tRNA and rRNA genes
#Run script from Termial as follows:
#perl gbk2intergenic.pl  <Your .gbk file> <>file_name.fasta> (don't put "<>")
#THE FIRST and LAST GENES DON'T HAVE INTERGENIC SEQUENCE!

use strict;
use Bio::SeqIO;

my $file = $ARGV[0];
my $in = Bio::SeqIO->new(-file=>"$file", -format=>"GenBank");
my $obj = $in->next_seq();

my @features = $obj->get_SeqFeatures();
my @two = shift @features;
        foreach my $fefe  (@features){
        my $pt = $fefe->primary_tag();
                if ($pt eq "CDS" or $pt eq "rRNA" or $pt eq "tRNA") {
                push @two, $fefe;
                my $endi = $two[0]->end();
                my $starti = $two[1]->start();
                my $end = $endi +1;
                my $start = $starti -1;
                my $subseq = $obj->subseq($end,$start) unless ($end>=$start);
                my $strand1 = $two[0]->strand();
                my $strand2 = $two[1]->strand();
                my $dir1= $strand1 == "1" ?   "for" :   "rev"; #ternary condition
                my $dir2=  $strand2 == "1" ?   "for" :   "rev";
                my ($locus1) = $two[0]->get_tag_values("locus_tag") if ($two[0]->primary_tag() eq "CDS"
                or $two[0]->primary_tag() eq "rRNA" or $two[0]->primary_tag() eq "tRNA");
                my ($locus2) = $two[1]->get_tag_values("locus_tag") if ($two[1]->primary_tag() eq "CDS"
                or $two[1]->primary_tag() eq "rRNA" or $two[1]->primary_tag() eq "tRNA");
                print ">$dir1$locus1$dir2$locus2\n$subseq\n"  unless ($end>=$start);
                shift @two;
                }
        }

the output should be like this

>forCLJU_c00180forCLJU_c00190
TATAATAATTATATTTAAACGTAGGGCAT
>forCLJU_c00190forCLJU_c00200
TTATACTTTTAATAATGGCTTAAATATAGTTACTTAAGGAAGTTATTCATTAAAATGGTTACTTCTTTTTTTATTTACTAGGATAGGTATATAAATTTTAACTTGATATAGTAAATAGCCTTGATTTT
AATTGGTATTTGTGTTAATATAACACTGTTAATTT
>forCLJU_c00200revCLJU_c00210
TAATCCATTCCAAACACTGACTTCCAGTGTTTTTTATTTTATCTAAAATCAGATAATTGTCCCCATTTTGTCCCCAAAGTTTTCAATGGGGCATT
>revCLJU_c00210revCLJU_c00220

for means the gene is on the forward strand

rev reverse strand

Good luck!

ADD COMMENT
1
Entering edit mode

I know that this was written long ago, but I was using part of your script and I believe there are some errors:

  1. It is considering the full genome as an intergene. A genome with size 1..3600 and some genes in between (one of which starts at position 300 for instance) is returning an intergene going from 381..3600, even though there are multiple genes between 381..3600.
  2. Because $end is always > $start, $subseq always returns undefined.
ADD REPLY
0
Entering edit mode

The error number 1 that I point out is related with the $two[0]->primary_tag() being "source" in the first feature search.

ADD REPLY
0
Entering edit mode

Thank you, Micro77 and r.b. for your inputs! This script is useful. I will read and use it with carefulness.

ADD REPLY
0
Entering edit mode

I think because you took part of the code, you run out into errors. Basically, $end is the end of the upstream gene and $start is the start of the downstream gene, so $start is always > $end (these values from different adjacent genes). I hope that answers your query.

ADD REPLY

Login before adding your answer.

Traffic: 2308 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6