Question

Fasta Format Of Chromosomes And Biopython

0

Entering edit mode

13.1 years ago

Ma ▴ 140

Hi: I am a newbie in this thing of bioinformatics and what I need is to extract a portion of a sequence in FASTA format, for that I have downloaded from ftp.ncbi.nih.gov/genomes/ the fasta file of chromosome 22, which is hsaltHuRef_chr22.fa.gz; and from that file I need to extract from nucleotide 23522552 to nucleotide 23660200 aproximately; how I can do that using BioPython? Also what does these headers refer to inside the fasta file?

gi|157812454|ref|NW001838719.2| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN1103279049253, whole genome shotgun sequence

gi|157697908|ref|NW001838720.1| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN1103279105977, whole genome shotgun sequence

Specially to those numbers, they are making reference to the nucleotide position that I am at that moment?

Thanks for your help

chromosome biopython • 6.2k views

ADD COMMENT • link updated 13.1 years ago by Damian Kao 16k • written 13.1 years ago by Ma ▴ 140

0

Entering edit mode

StackExchange works better with one question at a time - you only get to "accept" one person's answer.

ADD REPLY • link 13.1 years ago by Peter 6.0k

score 4 · Answer 1 · 2011-12-03

4

Entering edit mode

13.1 years ago

Damian Kao 16k

You have the human assembly from craig venter institute. It's an alternate assembly from the NCBI assembly. You might as well just stick with the NCBI version unless you specifically want to use this version.

To get a region from the fasta file with BioPython:

from Bio import SeqIO

inFile = open('path to your fasta file','r')

for record in SeqIO.parse(inFile,'fasta'):
    if record.id == "sequence id you want to extract from":
        print str(record.seq)[startCoordinates:endCoordinates + 1]

So let's say I want to extract nucleotide 100 to 200 from one of those sequences you listed: gi|157697908|ref|NW_001838720.1| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN_1103279105977, whole genome shotgun sequence

I would:

from Bio import SeqIO

inFile = open('hs_alt_HuRef_chr22.fa','r')

for record in SeqIO.parse(inFile,'fasta'):
    if record.id == "gi|157697908|ref|NW_001838720.1|":
        print str(record.seq)[100:201]

ADD COMMENT • link 13.1 years ago by Damian Kao 16k

1

Entering edit mode

You might find the Bio.SeqIO.index(...) or index_db(...) functions more concise than that loop approach.

ADD REPLY • link 13.1 years ago by Peter 6.0k

0

Entering edit mode

thank for your help @DK, by the way one question does record.seq takes into consideration the lines starting with >? I want to know this because if I got one record.id that is between the sequences that I want to retrieve, then it will count the characters include in the comment also