Hi, I am trying to use the METLIN small molecule dataset for ML, which has small molecules as SDF files. However, my downstream model requires input as a sequence. Is there any way to go from SDF to sequence files (eg. like a FASTA file)? Thanks for the help!
SDF files contain the protein structures, not the sequences themselves.
What you could do is convert the SDF into PDB format using any of the many available converters (maybe via Open Babel?), then chuck those PDB files into Foldseek.
Then pull out the sequences of the top hits.
Edit:
Looking at the data in the SDF file more, I don't think what you're trying to do is possible. The SDF file in your figshare link contains PUBCHEM_COMPOUND_IDs and you can pull out the molecules from there, here's the link for the first compound: https://pubchem.ncbi.nlm.nih.gov/compound/5139
That's not a protein nor a gene, that's C3H8N2S.