Hi: I need to make a comparison between normal chromosomes and translocated ones. So I have read that, for example, if a break occurs in the middle of the bcr region of chromosome 22 , this would join to a part of chromosome 9 known as ABL.
I got the complete chromosome 22 from ftp.ncbi.nih.gov/genomes/ and the name of the file is: hsaltHuRefchr22.fa. From ncbi I search the BCR breakpoint cluster region and it says the following data: Location : 22q11; 22q11.23 Sequence : Chromosome: 22; NC000022.10 (23522552..23660224)
So I suppose that the bcr region is in chromosome 22 in the q arm in location 23522552 to 23660224 nucleotides. I want to extract this part (the middle of the bcr as it says in the theory about this translocation) until the end of the chromosome 22. For this purpose I got the following program in BioPython:
from Bio import SeqIO
inFile = open('c:\\chromosomes\\chr22.fa','r')
for record in SeqIO.parse(inFile,'fasta'):
print str(record.seq)[23522552:]
The problem is that the program does not print anything at all. If somebody could explain me the reason it would be nice, also take in consideration that I am not a biologist so maybe this question will seem pretty obvious for a lot of people here.
Thanks
----edit---- When I make the following modification in the program:
from Bio import SeqIO
inFile = open('c:\\chromosomes\\chr22.fa','r')
s=0
for record in SeqIO.parse(inFile,'fasta'):
print len(record.seq)
for checking purposes, it prints me the following: 6192 5131 3591 3614 5140 10578 6350 6403 4083 4254 11445 3938 11265 12450 18128 7444 11947
so as you can see the length of each sequence is different, so I suppose that for that reason I cannot reach the position 23522552, even though when I use a simple sum like this:
for record in SeqIO.parse(inFile,'fasta'):
s=s+len(record.seq)
print s
the length of the fasta file, or the nucleotides that it has is 34103195. The problem here is that I really need to make the cut at the position 23522552, so I can get rid of that part until the end of the file. Thanks
The coordinates you posted: Location : 22q11; 22q11.23 Sequence : Chromosome: 22; NC_000022.10. The 'NC_000022.10' is the accession number for the sequence. So if you go here:http://www.ncbi.nlm.nih.gov/nuccore/NC_000022.10. That's the sequence the coordinates are referring to.
There is a private human genome project started by Craig Venter and a public human genome project. HuRef is from the private project. Since the two different projects use different assemblers and sequences the results are going to be somewhat different. Most people just stick with the NCBI version. I think the HuRef genome browser can map the huRef assembly to the NCBI assembly: http://huref.jcvi.org/. But going down that route seems like a hassle.
Are there no error messages? Is what you pasted exactly how it looks like? Python requires correct spacing to indicate code blocks.
Hi @DK, the program only print blank lines, please the edit that I have made in my original question so you can see what I have been trying so far. Thanks
Are you sure the sequences in that alternate assembly is in order? And the NCBI coordinates you found is for the NCBI assembly of the genome, not the HuRef (Craig Venter) version that you have.
Sorry @DK, but I am a little bit confused, I have seen that in ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_22/, for example are 2 fasta files: hs_alt_HuRef_chr22.fa.gz and hs_ref_GRCh37.p5_chr22.fa.gz, with which should I work and why? Thanks for your reply
oh,I get it! thanks @DK, another question what are the differences between those two files that appear in ftp.ncbi.nih.gov about each chromosome, I mean between hs_alt_HuRef and hs_ref_GRCh37.p5 referring to the same chromosome? If I want to make, for example comparison between chromosomes, it would be fine if I use only those that start with hs_alt_HuRef? Thanks and sorry for the basic questions
thanks again @DK, your comments are great!, one last general question, for example I have seen that the information contained in the FASTA file of hsaltHuRef_chr22.fa has in certain parts the following data:
I would really suggest you to read up on the genome assembly process from the library preparation to assembly. All genome projects sequence the genome in parts. The DNA is broken up into small pieces first and sequence individually. It is then the assembler's job to piece it all together by contiguous regions into large super contigs. Essentially you end up with a pool of large sequences without any order. What you can do to assign these contigs to chromosomes is to do a fluroescent hybridization to the chromosomes and visually observe where on the chromosome it is to order the contigs.
by the way @DK do you know how I can do to save my results into a FASTA file instead of just visualizing them? Thanks