Question

Where i can find complete genome of any organism?

0

Entering edit mode

7.8 years ago

saranpons3 ▴ 70

Hello all, Could anybody let me know where i can find a huge complete genome of any organism? i got a complete genome of Enterobacteria phage lambda from https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa. The same way i would like to get complete genome of some more organisms.

            Thanks in advance.

genome complete • 2.3k views

ADD COMMENT • link updated 7.7 years ago by Biostar 20 • written 7.8 years ago by saranpons3 ▴ 70

score 3 · Answer 1 · 2017-01-27

3

Entering edit mode

7.8 years ago

Sej Modha 5.3k

Any virus refseq genome can be downloaded from NCBI FTP. If you're interested in a virus genome for which a refseq genome does not exist then visit NCBI and search for the organism of interest and download the genome sequence from the NCBI browsing page.

For more info visit : https://www.ncbi.nlm.nih.gov/guide/howto/dwn-records/

ADD COMMENT • link 7.8 years ago by Sej Modha 5.3k

0

Entering edit mode

Thanks.

I wanted to download the complete genome of Drosophila melanogaster (fruit fly). While i searching i ended up here ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/ . Here lot of files are there. Which one i should download?

ADD REPLY • link 7.8 years ago by saranpons3 ▴ 70

0

Entering edit mode

This file has the genome sequence. If you need the protein sequences then download the file with faa in name.

README.txt file at the link has the information about the files in that directory.

ADD REPLY • link 7.8 years ago by GenoMax 147k

0

Entering edit mode

Thanks for pointing out which file to be downloaded.

When i downloaded and opened the GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna file(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz), i could see many lines starting with the character '>'.

I downloaded the complete genome of Enterobacteria phage lambda from https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa. In this file, i could see only one line starts with the character '>'. So I expected even this file GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna will have only one line will start with the character '>'. Also, I could see many A,T,C,G are in lower case letters. Also, I could see many 'N's....I would like to know that why many lines start with the character '>', many A,T,C,G are in lower case letters and Many N's in the file ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz?

Actually, i am trying to take complete genome of any organism and give the complete genome as an input to my perl script which will generate reads of same length randomly from the complete genome. Once reads are generated, i will give the reads as an input to the assembler which i designed and try to assemble to the original complete genome. Here, in my assembler, i'm not implementing de bruijn graph simplification and all.

ADD REPLY • link 7.8 years ago by saranpons3 ▴ 70

1

Entering edit mode

i could see many lines starting with the character '>'.

This is called a multi-fasta format file. It is used to represent more than one fasta sequence in a single file (e.g. think of multiple chromosomes, scaffolds, contigs etc that may represent a genome).

Also, I could see many 'N's

These generally represent regions where sequence may be unknown, not complete or difficult to accurately sequence (e.g. centromere, telomeric regions).. They are used to represent parts of the genome that are expected to be present but are missing.

ADD REPLY • link 7.8 years ago by GenoMax 147k

0

Entering edit mode

Thanks for the answer

ADD REPLY • link 7.8 years ago by saranpons3 ▴ 70

0

Entering edit mode

In the file complete genome file, i could see in many places, A,T,C,G are in lower case. The reason given in the README.txt is Repetitive sequences are in eukaryotes are masked to lower-case. If i want to generate random/simulated reads from this complete genome, should i convert the lowercase letters to uppercase and then simulate?