I am working with human_g1k_v37.fasta which is found on the 1000genomes site, specifically: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
I have parsed this into files of individual chromosomes.
The header for chromosome 1 looks like so:
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
So I have an index of 1 with the rest of the data as part of the description
My understanding is this header can be explained as so:
coord_system_name = chromosome
coord_system_version = GRCh37
seq_region.name = 1
seq_region.start = 1
seq_region.length = 249250621
seq_region. strand = 1
My question is, is there anything I can do in Biopython to read these values in? I am just identifying the file I am reading as a file of type "fasta". I am wondering if I must manually parse this out splitting on colon or if functions already exist in Biopython that can do this for me?
Here is an example of the code I use to read in this file:
def read_fasta_file(filename):
handle = open(filename, "rU")
for record in SeqIO.parse(handle, fileFormat):
print("ID %s" % record.id)
print("Sequence length %i" % len(record))
print("Sequence desc %s" % record.description)
print("Sequence alphabet %s" % record.seq.alphabet)
handle.close()
Mainly to ensure its GRCh37 and which chromosome. I have seen multiple file types store "annotations", sort of like key/value pairs. I figured there may be some tools in Biopython to help me get this information from any FASTA file.
You'll have to just parse the description with a regex. If you read through the README file that describes what you downloaded, you'll see that it's mostly GRCh37, with the MT sequence changed.
Thanks. I want my program to be able to read a FASTA file and identify what chromosome(s) are in it, so it can do proper sequencing to the reference chromosome(s). Obviously my reference chromosome has this colon delimited header. Would a typical, if there is such a term, FASTA header have a fairly standard way to identify which chromosome is being passed in? I guess I could make the assumption that anything my program is using, is Human, and I could just read the index and totally forget about the header, does that sound right?
The chromosome name is what follows the ">", so chromosome 1 in your case. The remainder of what you showed is typically not present. Don't expect anything other than a chromosome name. There is no general way to tell from a fasta file what organism it came from or what version it is.
Devon if you type in your basic reply as an answer I will mark it as answered, thank you for your help.
I just moved this stack of comments to an answer.