Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this?
its just an example i gave.. its like this.. i have chromosome sequence of approx 100000bp long and wanted to extract gene sequence based on start site and end site.. the thing is i dont want just the gene sequence, i am looking for extra 100bp head and tail for gene sequence.. and thats the reason i wanted to parse chromosome sequence file
Your comment says that what you really want is to extract features with flanking sequence. Assuming that you have some annotation, you can use EMBOSS extractfeat to do this.
The answer from neilfws is OK, the problem is that when you work with actual chromosomes (veeeery long sequences), and you want to extract many substring, that approach takes FOREVER
you need Bio::DB::Fasta
This is adapted from a script I wrote time ago
use Bio::DB::Fasta;
use Bio::SeqIO;
use Bio::Seq;
my $chrName = 'chr1';
my $start = 10;
my $end = 30;
my $fullGenomeFastaFile = 'genome.fa';
# index (only the first time) and connect to the fasta file
my $db = Bio::DB::Fasta->new( $fullGenomeFastaFile );
# prepare printing device (STOUT)
my $out = Bio::SeqIO->newFh(-format => 'Fasta' );
# prepare new name
my $newId = join('_', $chrName, $start, $end);
# extract target sequence
my $subSeq = $db->seq($chrName, $start => $end);
# create object with target
my $seqobj = Bio::Seq->new( -display_id => $newId, -seq => $subSeq);
# print it out
print $out $seqobj;
I'm not going to offer any code - that's been done. I just want to say that it is important to keep in mind what you wish to do with the results from a parsing of a chromosome sequence. Many genomes are not 100% complete and are so because it is difficult to sequence to high accuracy the telomeres (ends) and the centromere of the chromosome. Not all chromosomes from all organisms have telomeres and centromeres. Nonetheless, when they are there and when they are incompletely sequenced, the chromosome sequence itself is a bit contrived or artificial and may contain blocks of Ns to designate undefined sequence. The length of those blocks is often not known and thus your coordinates may be inaccurate. This could be irrelevant if you simply want the parsed sequenced, but if you want such a partial sequence to link with other information, you could run into problems.
Just something to keep in mind. Few chromosome sequences are absolutely complete.
Interesting reminder but I guess this should not be a problem if you use the genome build on which the annotation you use has been determined/computed. You should just be sure it is the case indeed.
library(Biostrings)
# yourdata.fa can have one or mutiple chromosomes included.
x <- read.DNAStringSet("/full_path_to/yourdata.fa")
x # names and length for the sequences in your fasta file
names(x) # get sequence names
start.pos <- 1 # here can be a vector of values for start position
end.pos <- 100 # here can be a vector of values for the corresponding end position
# a is a vector of string for the sequences you want
a <- getSeq(x, 'names(x)[1]', start=start.pos, end=end.pos) # a is a vector of string for the sequences you want
writeFASTA(a, file="a.fa") # write the seq to disk
# you could need to read the manu for some functions
?readFASTA
?writeFASTA
?getSeq
?read.DNAStringSet
is there specific reason why 2 - 10 should give you AATTCCAAA and vice versa?
its just an example i gave.. its like this.. i have chromosome sequence of approx 100000bp long and wanted to extract gene sequence based on start site and end site.. the thing is i dont want just the gene sequence, i am looking for extra 100bp head and tail for gene sequence.. and thats the reason i wanted to parse chromosome sequence file
cross posted here http://seqanswers.com/forums/showthread.php?t=10776 read both as some answer might overlap.