Question

Parsing A Chromosome Sequence With Start And End Sites

1

Entering edit mode

14.3 years ago

Empyrean ▴ 170

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this?

For Example Chr 1 is in following format

>chr1
GAATTCCAAAGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTCCTATCATTCTTTCTG
ATGTTGAAAATGATATTAAG

I need regions from 2 - 10 should give me AATTCCAAA

and i a similar way 15- 25 should give me AAGATTGCAT

How can i do it either in perl or bioperl or awk or any other way?

bioperl perl parsing sequence • 7.7k views

ADD COMMENT • link updated 14.3 years ago by biobot 0.0.77.a.1099 6.2k • written 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

is there specific reason why 2 - 10 should give you AATTCCAAA and vice versa?

ADD REPLY • link 14.3 years ago by Thaman ★ 3.3k

0

Entering edit mode

its just an example i gave.. its like this.. i have chromosome sequence of approx 100000bp long and wanted to extract gene sequence based on start site and end site.. the thing is i dont want just the gene sequence, i am looking for extra 100bp head and tail for gene sequence.. and thats the reason i wanted to parse chromosome sequence file

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

cross posted here http://seqanswers.com/forums/showthread.php?t=10776 read both as some answer might overlap.

ADD REPLY • link 14.3 years ago by Stefano Berri 4.4k

Ram · Answer 1 · 2011-04-15

EMBOSS extractseq (and see this question)

keith@tao:~$ extractseq chr1.fa -filter -stdout -regions '2-10 15-25' -separate
>chr1_2_10
AATTCCAAA
>chr1_15_25
AAGATTGCATC

On the reverse strand, also use revseq:

keith@tao:~$ extractseq chr1.fa -filter -stdout -regions '2-10 15-25' -separate | revseq -filter

>chr1_2_10 Reversed:
TTTGGAATT
>chr1_15_25 Reversed:
GATGCAATCTT

Your comment says that what you really want is to extract features with flanking sequence. Assuming that you have some annotation, you can use EMBOSS extractfeat to do this.

score 3 · Answer 2 · 2011-04-15

3

Entering edit mode

14.3 years ago

Neilfws 49k

Easy in Bioperl. This code assumes file is in fasta format and is named chr1.fa:

use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "chr1.fa", -format => "fasta");
while(my $seq = $seqio->next_seq) {
  print $seq->subseq(2,10), "\n";
}

Take a look at the SeqIO how-to for more.

*Update: as Stefano points out, this will be slow for large chromosomes; see his answer for another solution.*

ADD COMMENT • link 14.3 years ago by Neilfws 49k

0

Entering edit mode

This is so simple.. thank you so much.. i tried seqio but didnot know about the subseq function..

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

0

Entering edit mode

I am struck and how can i do this for negative strand given start and end sites?

ADD REPLY • link 14.3 years ago by Empyrean ▴ 170

score 1 · Answer 3 · 2011-04-15

Hi. Definetly use BioPerl.

The answer from neilfws is OK, the problem is that when you work with actual chromosomes (veeeery long sequences), and you want to extract many substring, that approach takes FOREVER

you need Bio::DB::Fasta This is adapted from a script I wrote time ago

use Bio::DB::Fasta;
use Bio::SeqIO;
use Bio::Seq;

my $chrName = 'chr1';
my $start = 10;
my $end = 30;
my $fullGenomeFastaFile = 'genome.fa';

# index (only the first time) and connect to the fasta file
my $db = Bio::DB::Fasta->new( $fullGenomeFastaFile );
# prepare printing device (STOUT)
my $out = Bio::SeqIO->newFh(-format => 'Fasta' );
# prepare new name
my $newId = join('_', $chrName, $start, $end);
# extract target sequence
my $subSeq = $db->seq($chrName, $start => $end);
# create object with target
my $seqobj = Bio::Seq->new( -display_id => $newId, -seq => $subSeq);
# print it out
print $out $seqobj;

score 1 · Answer 4 · 2011-04-15

I'm not going to offer any code - that's been done. I just want to say that it is important to keep in mind what you wish to do with the results from a parsing of a chromosome sequence. Many genomes are not 100% complete and are so because it is difficult to sequence to high accuracy the telomeres (ends) and the centromere of the chromosome. Not all chromosomes from all organisms have telomeres and centromeres. Nonetheless, when they are there and when they are incompletely sequenced, the chromosome sequence itself is a bit contrived or artificial and may contain blocks of Ns to designate undefined sequence. The length of those blocks is often not known and thus your coordinates may be inaccurate. This could be irrelevant if you simply want the parsed sequenced, but if you want such a partial sequence to link with other information, you could run into problems.

Just something to keep in mind. Few chromosome sequences are absolutely complete.

Neilfws · Answer 5 · 2011-04-15

It is also easy in R/bioconductor`(http://www.bioconductor.org/).

Here, the codes:

library(Biostrings)

# yourdata.fa can have one or mutiple chromosomes included.
x <- read.DNAStringSet("/full_path_to/yourdata.fa")

x # names and length for the sequences in your fasta file
names(x) # get sequence names
start.pos <- 1 # here can be a vector of values for start position
end.pos <- 100 # here can be a vector of values for the corresponding end position

# a is a vector of string for the sequences you want   
a <- getSeq(x, 'names(x)[1]', start=start.pos, end=end.pos) # a is a vector of string  for the sequences you want

writeFASTA(a, file="a.fa") # write the seq to disk

# you could need to read the manu for some functions
?readFASTA
?writeFASTA
?getSeq
?read.DNAStringSet

score 0 · Answer 6 · 2011-04-15

0

Entering edit mode

14.3 years ago

Pierre Lindenbaum 166k

with one command line:

 > grep -v '^>' < hg18/chr1.fa | tr -d "\n" | cut -c 2-10
 aaccctaac

of course, it only works if there's only one sequence in the fasta file and if you don't need have to extract too many substrings.