My situation:
I am trying to download some files from the NCBI GENOME ftp site ftp://ftp.ncbi.nih.gov/genomes/
for a specific organism(Anolis carolinensis in this case). Right now I am trying to perform a blast search on this organism to find out if there is a significant match for a specific gene. When performing this blast search on refseq genomic, with the gilist for Anolis carolinensis (even with the smallest word size possible) the blast results do not return a significant match; I've been told this should likely not be the case so I would like to perform this search on other databases.
What I would like to know: I want to expand my search beyond just searching refseq_genomic with my gilist but do not know what data is reliable/unreliable or what data is most likely to show results. I do not know whether I should do the search against the Gnomon/ref_AnCar2.0_top_level.gff3.gz file, theGff/ref_AnCar2.0_top_level.gff3.gz file, RNA/* files, etc. The README file mentions some of these are scaffold assemblies, and some sequences are whole genome shotgun sequences are these reliable? I would also like to know whether theres any reliablebility in the other/pseudo_without_products.fa.gz. I know that I can just google what 'scaffold assembly' or something is but this doesn't really tell me if I should be searching against it.
Thank-you
gff3 files you are listing are just annotation. Take a look at this README to get an idea of what the different sequence file types are.
I feel like I do not have a good enough grasp on a lot of the things said in that document to interpret it correctly. I wouldn't have known to look up what annotation means if you hadn't mentioned it here. Are there a few words or phrases I should try to cherry pick within this README document?
Also the gff isn't refseq? What does the ref in the name mean then.
You can easily limit your blast searches at NCBI's site by using the name/taxid of your organim (Anolis carolinensis (taxid:28377)) in the "choose search set --> Organism field".
Do you have a single gene sequence that you are searching with? Is it DNA/protein?
GFF is not RefSeq. GFF is a file format for annotations.
You may want to use the current RefSeq version of the Anolis genome for your blast searches. It can be found in this directory. DNA sequence is here and protein is here.
I have to perform mass amounts of searches, so I can't use the site for a lot of what I'm doing. What made you pick that file instead of some of the other files within GCF_000090745.1_AnoCar2.0? What is different about the genomic.gbff.gz file
Since you need a fasta formatted sequence file for creating blast indexes I selected
.fna
file.gbff
is genbank formatted version of the same genome. Read about what the different files are, here.