Which files in the NCBI Genomes ftp site are the most reliable
0
0
Entering edit mode
7.5 years ago
Jacob ▴ 10

My situation: I am trying to download some files from the NCBI GENOME ftp site ftp://ftp.ncbi.nih.gov/genomes/ for a specific organism(Anolis carolinensis in this case). Right now I am trying to perform a blast search on this organism to find out if there is a significant match for a specific gene. When performing this blast search on refseq genomic, with the gilist for Anolis carolinensis (even with the smallest word size possible) the blast results do not return a significant match; I've been told this should likely not be the case so I would like to perform this search on other databases.

What I would like to know: I want to expand my search beyond just searching refseq_genomic with my gilist but do not know what data is reliable/unreliable or what data is most likely to show results. I do not know whether I should do the search against the Gnomon/ref_AnCar2.0_top_level.gff3.gz file, theGff/ref_AnCar2.0_top_level.gff3.gz file, RNA/* files, etc. The README file mentions some of these are scaffold assemblies, and some sequences are whole genome shotgun sequences are these reliable? I would also like to know whether theres any reliablebility in the other/pseudo_without_products.fa.gz. I know that I can just google what 'scaffold assembly' or something is but this doesn't really tell me if I should be searching against it.

Thank-you

genome sequence ncbi blast • 1.7k views
ADD COMMENT
1
Entering edit mode

gff3 files you are listing are just annotation. Take a look at this README to get an idea of what the different sequence file types are.

ADD REPLY
0
Entering edit mode

I feel like I do not have a good enough grasp on a lot of the things said in that document to interpret it correctly. I wouldn't have known to look up what annotation means if you hadn't mentioned it here. Are there a few words or phrases I should try to cherry pick within this README document?

Also the gff isn't refseq? What does the ref in the name mean then.

ADD REPLY
1
Entering edit mode

You can easily limit your blast searches at NCBI's site by using the name/taxid of your organim (Anolis carolinensis (taxid:28377)) in the "choose search set --> Organism field".

Do you have a single gene sequence that you are searching with? Is it DNA/protein?

GFF is not RefSeq. GFF is a file format for annotations.

You may want to use the current RefSeq version of the Anolis genome for your blast searches. It can be found in this directory. DNA sequence is here and protein is here.

ADD REPLY
0
Entering edit mode

I have to perform mass amounts of searches, so I can't use the site for a lot of what I'm doing. What made you pick that file instead of some of the other files within GCF_000090745.1_AnoCar2.0? What is different about the genomic.gbff.gz file

ADD REPLY
1
Entering edit mode

Since you need a fasta formatted sequence file for creating blast indexes I selected .fna file. gbff is genbank formatted version of the same genome. Read about what the different files are, here.

ADD REPLY

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6