Parsing A Chromosome Sequence With Start And End Sites
6
1
Entering edit mode
13.8 years ago
Empyrean ▴ 170

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this?

For Example Chr 1 is in following format

>chr1
GAATTCCAAAGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTCCTATCATTCTTTCTG
ATGTTGAAAATGATATTAAG

I need regions from 2 - 10 should give me AATTCCAAA

and i a similar way 15- 25 should give me AAGATTGCAT

How can i do it either in perl or bioperl or awk or any other way?

bioperl perl parsing sequence • 6.9k views
ADD COMMENT
0
Entering edit mode

is there specific reason why 2 - 10 should give you AATTCCAAA and vice versa?

ADD REPLY
0
Entering edit mode

its just an example i gave.. its like this.. i have chromosome sequence of approx 100000bp long and wanted to extract gene sequence based on start site and end site.. the thing is i dont want just the gene sequence, i am looking for extra 100bp head and tail for gene sequence.. and thats the reason i wanted to parse chromosome sequence file

ADD REPLY
0
Entering edit mode

cross posted here http://seqanswers.com/forums/showthread.php?t=10776 read both as some answer might overlap.

ADD REPLY
4
Entering edit mode
13.8 years ago

EMBOSS extractseq (and see this question)

keith@tao:~$ extractseq chr1.fa -filter -stdout -regions '2-10 15-25' -separate
>chr1_2_10
AATTCCAAA
>chr1_15_25
AAGATTGCATC

On the reverse strand, also use revseq:

keith@tao:~$ extractseq chr1.fa -filter -stdout -regions '2-10 15-25' -separate | revseq -filter

>chr1_2_10 Reversed:
TTTGGAATT
>chr1_15_25 Reversed:
GATGCAATCTT

Your comment says that what you really want is to extract features with flanking sequence. Assuming that you have some annotation, you can use EMBOSS extractfeat to do this.

ADD COMMENT
3
Entering edit mode
13.8 years ago
Neilfws 49k

Easy in Bioperl. This code assumes file is in fasta format and is named chr1.fa:

use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "chr1.fa", -format => "fasta");
while(my $seq = $seqio->next_seq) {
  print $seq->subseq(2,10), "\n";
}

Take a look at the SeqIO how-to for more.

*Update: as Stefano points out, this will be slow for large chromosomes; see his answer for another solution.*

ADD COMMENT
0
Entering edit mode

This is so simple.. thank you so much.. i tried seqio but didnot know about the subseq function..

ADD REPLY
0
Entering edit mode

I am struck and how can i do this for negative strand given start and end sites?

ADD REPLY
1
Entering edit mode
13.8 years ago

Hi. Definetly use BioPerl.

The answer from neilfws is OK, the problem is that when you work with actual chromosomes (veeeery long sequences), and you want to extract many substring, that approach takes FOREVER

you need Bio::DB::Fasta This is adapted from a script I wrote time ago

use Bio::DB::Fasta;
use Bio::SeqIO;
use Bio::Seq;

my $chrName = 'chr1';
my $start = 10;
my $end = 30;
my $fullGenomeFastaFile = 'genome.fa';

# index (only the first time) and connect to the fasta file
my $db = Bio::DB::Fasta->new( $fullGenomeFastaFile );
# prepare printing device (STOUT)
my $out = Bio::SeqIO->newFh(-format => 'Fasta' );
# prepare new name
my $newId = join('_', $chrName, $start, $end);
# extract target sequence
my $subSeq = $db->seq($chrName, $start => $end);
# create object with target
my $seqobj = Bio::Seq->new( -display_id => $newId, -seq => $subSeq);
# print it out
print $out $seqobj;
ADD COMMENT
0
Entering edit mode

Thank you.. i am struck again.. i need to extract sequences for neg strand.. how can i do that??

ADD REPLY
1
Entering edit mode
13.8 years ago

I'm not going to offer any code - that's been done. I just want to say that it is important to keep in mind what you wish to do with the results from a parsing of a chromosome sequence. Many genomes are not 100% complete and are so because it is difficult to sequence to high accuracy the telomeres (ends) and the centromere of the chromosome. Not all chromosomes from all organisms have telomeres and centromeres. Nonetheless, when they are there and when they are incompletely sequenced, the chromosome sequence itself is a bit contrived or artificial and may contain blocks of Ns to designate undefined sequence. The length of those blocks is often not known and thus your coordinates may be inaccurate. This could be irrelevant if you simply want the parsed sequenced, but if you want such a partial sequence to link with other information, you could run into problems.

Just something to keep in mind. Few chromosome sequences are absolutely complete.

ADD COMMENT
0
Entering edit mode

Interesting reminder but I guess this should not be a problem if you use the genome build on which the annotation you use has been determined/computed. You should just be sure it is the case indeed.

ADD REPLY
0
Entering edit mode
13.8 years ago
Jianfengmao ▴ 320

It is also easy in R/bioconductor`(http://www.bioconductor.org/).

Here, the codes:

library(Biostrings)

# yourdata.fa can have one or mutiple chromosomes included.
x <- read.DNAStringSet("/full_path_to/yourdata.fa")

x # names and length for the sequences in your fasta file
names(x) # get sequence names
start.pos <- 1 # here can be a vector of values for start position
end.pos <- 100 # here can be a vector of values for the corresponding end position

# a is a vector of string for the sequences you want   
a <- getSeq(x, 'names(x)[1]', start=start.pos, end=end.pos) # a is a vector of string  for the sequences you want

writeFASTA(a, file="a.fa") # write the seq to disk

# you could need to read the manu for some functions
?readFASTA
?writeFASTA
?getSeq
?read.DNAStringSet
ADD COMMENT
0
Entering edit mode
13.8 years ago

with one command line:

 > grep -v '^>' < hg18/chr1.fa | tr -d "\n" | cut -c 2-10
 aaccctaac

of course, it only works if there's only one sequence in the fasta file and if you don't need have to extract too many substrings.

ADD COMMENT
0
Entering edit mode

This works great too.. thank you

ADD REPLY

Login before adding your answer.

Traffic: 3265 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6