You might consider how Google approached building a search engine using the HTTP protocol.
HTTP has almost no notion of metadata, outside of perhaps specifications for MIME types and stream size or last-updated tags. Even then, MIME types are specified entirely at the discretion of the server or developer, and the rest of the metadata is mostly useless for scientific work. Assuming that what you're retrieving will have the wrong MIME type is safe. You might be pleasantly surprised if it does match and you can interpret what you get back.
Maybe you'll be lucky and some developer adds custom x-*
headers to HTTP responses that are domain-specific. I will bet that practically no one does this for public services because there is virtually no standard that is agreed upon for custom HTTP headers where biological databases shared via REST or other HTTP-trafficked services.
Shrug. The "shruggle" is real.
The way web search engines work is by downloading, processing, and indexing. There's almost nothing inherent in HTTP to help search engines with the kind of searches 99.9999% of all users make.
In your case, perhaps, you might add HTTP GETs with FTP gets, making use of the content offered by databases and the content in cited publications (including citations themselves, and where they go) as a way to tag or index resources for searching, as well as drawing a weighted graph of interconnected or related resources in order to rank search results.
You're effectively rebuilding a web search engine at this point, but as you have domain knowledge (ie a background in biological sciences) that most — nearly all, in fact — Google engineers do not, maybe you can build a better biology-focused search engine informed by your knowledge and what you know would be useful to other biologists.
Build a better mm10 trap!
It looks like you're trying to build a registry or catalog of biological databases. This has been done several times before with mostly limited success.
Airport.bio works well but is still in its early stages.
It's not that the previous attempts were not working. It's mostly that they didn't get many users and many eventually died because they were not maintained: biological databases are continuously being produced, change URL as labs move and tend to have a short life on the web so any catalog has to be continually updated or lose relevance. If you want your resource to be useful, you have to offer a service that's better than e.g. Google as a baseline, both in terms of usability and information provided.
This is why we added a "Suggest new server" button. You can always add new databases as they come out (e.g., new publications) or change the URL of existing databases as needed (if they move). Google's crawlers don't crawl this deep, and Google doesn't index FTP like HTTP.
This is precisely what nobody did in previous attempts.
FTP as in file transfer protocol?
Yes, that's correct.
Serious question. What is the benefit of this tool? You would keep checking and make sure those FTP links are not stale? Would there be a free form search interface that would suggest multiple sites?
What do you mean by "Would there be a free form search interface that would suggest multiple sites?"
Someone would type "rat genes" and you would show them available databases. Or am I missing the scope of the tool?
My goal is to ultimately allow users to type in any query into the search bar and it will search the respective metadata of all the files in all the selected biological databases. However, I'm seeking advice on how to integrate metadata into the current FTP search engine. FTP is a great protocol to simultaneously connect to multiple databases at once and search them in parallel. However, FTP does not include any metadata information for the respective file descriptions (i.e., the full paths of the files). Certain directories include a README file describing its contents, but that's pretty much it. Nothing like the level of metadata description you would expect from tools like metaSRA or GEO.
In what sense is FTP:
what do you mean by "searching via FTP"? AFAIK FTP has no support for searching.
It searches through a database of already traversed paths and connects to the respective path via FTP.
Your descriptions in general and the presentation of the service itself suffers greatly from wrong kind of specificity. THere is little information about your service from the point of view of what an end user needs.
This has nothing to do with it being a "early" attempt or not.
There is a lack of clarity of explaining in simple terms what the service does. I tried it a few times, read through this thread and still do not understand it the very basic idea behind it - What does this do?
When you say:
This explanation is a good example of a recurring pattern of writing answers that have little actionable information: