Hello all,
I'm sure everyone here has heard about the FASTA file format - today I discovered that this format is actually older than I am, having been "defined" in 1987 or earlier.
From what I can gleam from the original paper entitled "Improved tools for biological sequence comparison", FASTA was an alignment program that could work on both FASTP (protein) and FASTN (nucleic acid) sequences, thus, "FAST All". It would read and write sequence data in the FASTA format we all know and love.
So - I'm curious to know from some of the more established members of the community how popular the FASTA program was back in the day. Was it the program's popularity that pushed the usage of the FASTA file format, or was the FASTA program not particularly exciting, but the file format was good enough to hold it's own against the other standards of the time?
Thank you for your time, it's much appreciated :)
The FASTA similarity searching program was as popular, or perhaps more popular than BLAST until BLAST was used for many of the annotations of the Drosophila and human genomes in 2001. Even 10 years ago, I believe there were more FASTA searches done at the EBI than BLAST searches. And, the fact that BLAST used the "FASTA format" for query sequences helped the format persist, even though the program itself is less well known. The major benefit that BLAST had initially was explicit (thought not so accurate) expectation values. Later, it gained market share because the NCBI only used BLAST.
BLAST is certainly faster, and marginally more sensitive with default parameters for proteins, but less sensitive for DNA. FASTA uses more sophisticated methods for incorporating frame-shifts in protein vs translated DNA alignments. Today, a major advantage of the FASTA programs is the ability to use a wider range of scoring matrices, and facilities to merge annotations into alignments (reference).
The FASTA format has persisted because it is incredibly simple and flexible. All the other sequence formats available at the time were, in some sense, "punch-card" (fixed fields) based.
Since you are the right person to clarify this .. would you mind confirming if you invented/defined the FASTA file format? Was there a formal specification that was proposed/published for FASTA format?
Thank you for adding your perspective!
Yes, David Lipman and I invented "FASTA" format when we wrote "FASTP".
When David Lipman and I wrote the FASTP program (in the fall of 1983) for protein sequence similarity searching, there were only two protein sequence datasets (GenBank existed for DNA, but there was no equivalent protein database), the PIR protein database (originated by Margaret Dayhoff), and Russ Doolittle's "NEWAT" protein database. Because the PIR was in Washington DC and we had some connections to it, we got a copy of the PIR database, which was formatted like this:
The problem with this format was that people kept forgetting the "description" line before the sequence, so when they searched their databases, they lost the first line of sequence. We decided to simplify things by putting the description on the same line as the '>'.
Was the format named "FASTA" because it was the default input format for the FASTA suite of programs?
Yes. FASTA (as a program) was more popular than FASTP, because it worked with both DNA and protein sequences, and did primitive translated alignment (and there were many more sequences in 1988 compared with 1985). So FASTA stuck. I suppose that, between 1985 and 1988, FASTA format might have been called FASTP format (since there was no FASTA). To support the many sequence database distributions of the time, both FASTP and FASTA supported (and still supports) several different sequence library formats.
I have so many questions Professor -- it's not every day you get the opportunity to talk to someone who shaped the entire field! To run the risk of sounding like a door-to-door evangelist - would you have a few moments to talk about Bioinformatics? :)
I will try to contain my excitement and just stick to 2 questions:
Your education, if i'm not mistaken, was Chemistry -> Biochemistry -> Molecular Biology. To this day, a typical PhD student in any of these fields would be unlikely to know how to program, let alone be able to write software as influential as what yourself and Dr. Lipman wrote nearly 30 years ago. It is therefore a frequent topic of discussion among Bio/Chem students whether it is good to "diversify" and learn programming - running the risk of becoming the jack of all trades but the master of none - or to continue to specialize in their main field and essentially leave the "computer stuff" to the computer scientists/bioinformaticians. This is perhaps not a proper question, but I would be very interested to hear your thoughts on which direction young researchers and educators should move toward.
As a sort of follow-up from question 1, it is very hard for some of us to reflect on the field as a whole as many have not been here for all that long. The probability of staying in Academia for some young PhD students is perhaps not as high as many would like, yet PhD/MSc students make up a significant proportion of the work force. It is therefore difficult, when you do not have the experience, to get an idea of what problems are perhaps the most pressing for the field as a whole. So my question is, in very broad terms, what do you think the field of Bioinformatics could be doing better? What areas would you suggest young Bioinformaticians to take a second look at, to see if they can build something there?
(1) I took a programming course as an undergraduate, and had a summer job writing a "future simulation" game on the PLATO teaching computer, a machine with very little memory, so I learned how to pack small integers into a larger word (something that helped when writing FASTP and FASTA). As a graduate student, I took a "minor" in Computer Science (with some applied math) which introduced me to algorithms and structured programming. When I was a molecular biology graduate student (and I do not think this has changed much), graduate school was largely an apprenticeship at the bench to learn how to purify things and analyze their properties. Graduate students learned at the bench, not in the classroom, but my minor greatly supplemented my bench learning. As a graduate student, I believed that I could pick up virtually any bench technique; the problem was to pick an interesting problem and figure out a novel way to explore it.
I do not think that scientific research is a set of skills, so that one should worry about being a "master," or not. The challenge of science is critical thinking - identifying phenomena or theories that "don't make sense" because they are poorly understood or incomplete (or just wrong). I think of my success as more the result of a useful flavor of informed skepticism (or naive intuition), rather than a particular skill set or knowledge base. I think biologists should learn some programming because it allows them to ask their own questions about large datasets - to look for oddities (most of which will be artifacts) and filter them to sometimes find new knowledge. Here, I strongly agree with Sean Eddy (Sequencing for Neuroscience) when he states that "Biologists need to do their own data analysis" and "Scripting is a lab skill, like pipetting." Experimental biologists learn a kind of skepticism, because their biological system is constantly "tricking" them, that is very difficult to acquire in other disciplines.
(2) I did not appreciate until well into graduate school how egotistical one must be to be a scientist. I just liked doing experiments. Scientists believe that they can discover things that the other smart people in the field missed, or thought were unimportant. Science is not engineering - you cannot know whether the problem you study can actually be "solved" (perhaps this is true for engineering as well). I ended up doing the things I did because I really enjoyed the work, and I could (sometimes) see how it might make a contribution. I cannot overemphasize how important it is to find a problem (or approach) that grabs you emotionally -- one that you really enjoy working on (because you will do an enormous amount of work, most of which does not get published). I think many scientists first find something they like to do, and then figure out a way make what they are doing relevant to a larger problem. No doubt it is better to choose an important problem that lets you do something you like, but I have been fortunate to be able to do several things that I really liked, and then find uses for them. So I'm not comfortable suggesting problems, but I think there are a lot of results in bioinformatics that do not make complete sense, so there are lots of opportunities. But pick one that "grabs" you.
This is a bit off-topic (though related) but I've been trying to figure out what the FAST in FASTA, FASTQ, etc. stands for and can't find any answer. Can you shed some light on this, Bill?
FASTP and FASTA (and I assume FASTQ, but I did not name it) are NOT acronyms. We picked "FAST" because our method was much faster than previous methods (at the time, we reduced search time from 24hr on a VAX750 to about 5 min), the P was for "protein", later A was for "All" (there was also a FASTN for DNA).
See this guide for FASTA programs from Bill Pearson himself. First page lists advantages of FASTA program over blast.
A look back at sequence alignment from The Scientist.