EDIT: Since the data I'm looking for isn't available, my new question is if it's possible to concatenate together the sequence pieces from a fasta file that lists pieces of the sequence? How do I interpret what each part of the query template name means in the fasta file? I assume one of the number at the end refers to chromosome and other refer to start/end positions relative to the entire genome. If I know the start/end positions, I can order the pieces together, noting the gaps in between. For instance, for individual Sid1253, this is a query template name and sequence associated with it:
>M_SOLEXA-GA02_JK_PE_SL49:2:91:375:1301
TGCTCAGGTGGAGTGAGGGGAAAATGTTTTCAGGTTGTATTAGTCAAAACAAAATA
OLD POST: I'm looking to download several (3 to 6) Neanderthal genomes which have been mapped to a human reference genome. The file format should be fasta. I've checked the Neanderthal Genome Project and found several bam files, which I converted to fasta. These are the links to them: ftp://ftp.ebi.ac.uk/pub/databases/ensembl/neandertal/BAM_files/ http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/bam/
However, the fasta files list each individual's genome as snippets (from my interpretation; I've only begun to work with fasta formats). I think that those pieces can be concatenated together to give entire chromosome sequences, but I'm not sure how to do that. So I'm looking for the entire, long genome. More specifically, I'm looking for the chromosome-level sequence for each Neanderthal individual, where regions that haven’t been sequenced are masked as N's.
My questions are: 1. Where can I find this data? 2. If this data isn't available in the desired format, is it possible to concatenate together the sequence pieces from those links? How do I interpret what each part of the query template name means in the fasta file?
Thanks.
How did you convert them to fasta (by generating a consensus from the BAM)? Someone else who knows more will comment but I doubt you are going to find chromosome-size fasta files for Neanderthal genomes.
I used samtools then seqtk
This previous Biostars post will take you to some files available on ENA (Study: ERP000119) with the the draft sequence of the Neandertal genome (over 3 billion nucleotides) from three individuals. Would it be of any use for what you are after?
Thanks, I'll take a look at it
Simple answer is no for the new question you posted after the edit.
I am reasonably certain that the example sequence you posted is a fasta format version of a standard Illumina fastq read (header) . The original fastq headers have a specific meaning which signifies the position of the cluster (in a specific lane at x,y location) where the sequence originated on a flowcell.
You may be able to get a consensus sequence from the BAM files (see: Generate consensus from BAM file ) that you have seen everywhere though that may not be the correct thing to do (otherwise the people who generated the sequence would have done that).