Hey everyone!
I launched a new chatbot today that is bioinformatics focused!
It's trained on bioinformatics content and should help debug / ideate much faster for you than vanilla ChatGPT. Check it out here: https://ai.tinybio.cloud/chat
Thanks!
Hey everyone!
I launched a new chatbot today that is bioinformatics focused!
It's trained on bioinformatics content and should help debug / ideate much faster for you than vanilla ChatGPT. Check it out here: https://ai.tinybio.cloud/chat
Thanks!
It's very eager to provide code examples with Biopython, but the examples veer into the nonsensical at times:
It also shows a weird obsession with bringing up something called the "150 Tomato Genome ReSequencing project" without prompting, to the extent that it hallucinates it into URLs and citations. But, maybe that's just an artifact of my prodding it with obscure edge cases. (I'd be curious if anybody else had that come up!)
On that note it is evidently capable of providing references that actually exist (looking at you, vanilla ChatGPT) but gets cagey when you point out that the references don't actually support the claim:
In the interest of full disclosure: I'm very skeptical of getting trustworthy output from generative large language models no matter the approach to training, though I'm trying to stay open-minded for potential use cases.
This is awesome prodding. Will definitely try to address this use case; especially it being more explicit with the citations back to papers. Also - very obvious that we need to add copy and paste functionality for everyone so that we don't need screenshots. Thanks for trying it out.
Thanks for sharing ! I asked about score calculation answer is quite good , though the last example seems not fully correct
May be you can consider to make a talk about that work for your community "sciBerloga" - see previous talks: https://www.youtube.com/@SciBerloga , anounces: https://t.me/sberlogabig , chat: https://t.me/sberlogabio
Looks like the URL got garbled in the link somehow (unless it's just me?) -- should be: https://www.nature.com/articles/d41586-023-01833-0
Hey everyone - we're in the process of adding personalization features to the tool. Here is what we're thinking about right now:
General feedback and ideas are appreciated as well. We want to build the things that folks would actually find useful. We started a feature base board here (https://tinybio.featurebase.app/) for people to submit ideas to or reply to this answer.
I asked the same questions to tinyBio and chatGPT and chatGPT gave me better answers. This could be because chatGPT has been out for a while and learned more. This is a highly domain specific query by the way:
Question-1, tinyBio:
Same question, Chat GPT:
What tools can I use to preprocess raw reads that come from multiple species?
When dealing with raw reads from multiple species, there are several tools and approaches you can use for preprocessing. Here are some commonly used tools in bioinformatics:
Trimmomatic: Trimmomatic is a popular tool for trimming adapter sequences, removing low-quality bases, and filtering reads based on quality scores. It supports multiple sequencing platforms and can handle reads from different species.
Cutadapt: Cutadapt is another tool for adapter trimming and removing low-quality bases. It provides a flexible and customizable approach for handling reads from multiple species.
Fastp: Fastp is a fast and efficient tool for preprocessing next-generation sequencing data. It performs adapter trimming, quality filtering, and provides options for handling data from different species.
Seqtk: Seqtk is a toolkit for processing FASTQ files. It can perform various operations such as subsampling, trimming, and filtering reads based on quality scores. Seqtk is versatile and can handle reads from multiple species.
BBTools: BBTools is a suite of bioinformatics tools that includes several utilities for read preprocessing. It provides options for adapter trimming, read filtering, and quality control. BBTools is compatible with reads from different species.
Trimmomatic, Cutadapt, Fastp, Seqtk, and BBTools are command-line tools commonly used in bioinformatics. If you prefer a graphical user interface (GUI), you can consider tools like FastQC, which provides a comprehensive quality control analysis of sequencing data. FastQC is species-agnostic and can be used for analyzing reads from different organisms.
Remember that the choice of preprocessing tools may also depend on the specific requirements of your data, sequencing platform, and downstream analysis. It's always a good idea to explore documentation, tutorials, and user forums specific to the tools you choose to get the best results for your multi-species data.
BBTools is a relevant answer to my question.
Then, I got more specific:
Question-2, tinyBio:
Same question to ChatGPT:
What tool can I use to preprocess reads from xenograft samples?
When working with xenograft samples, which involve grafting tissues or cells from one species into another, you may encounter some challenges during read preprocessing. One common issue is the presence of reads from both the host species and the grafted species. To address this, you can utilize tools that can differentiate between the two species and selectively preprocess the reads. Here are a few tools commonly used for preprocessing xenograft samples:
Xenome: Xenome is a specialized tool designed specifically for processing xenograft sequencing data. It uses a two-step approach to classify reads into host and graft species based on alignment against reference genomes. Xenome allows you to separate the two species and perform subsequent preprocessing steps independently.
HISAT2 + StringTie: HISAT2 is a fast and accurate aligner that can align reads to both host and graft genomes simultaneously. By using HISAT2 followed by StringTie, you can identify and quantify gene expression from both species separately. This approach can be helpful when analyzing xenograft RNA-seq data.
Bowtie2 + Samtools: Bowtie2 is a popular aligner that can handle reads from multiple species. By aligning the reads to a combined reference genome that includes both host and graft sequences, you can separate the reads based on their alignment positions. Samtools can then be used to extract reads mapped to specific regions, allowing you to preprocess reads from the desired species.
Subread: Subread is a versatile read aligner that can handle reads from multiple species. It supports unique alignment modes and allows you to specify the desired reference genome for each species. Subread can be used to preprocess reads from xenograft samples by aligning them separately to the host and graft genomes.
These tools provide specific functionalities to address the unique challenges posed by xenograft samples during read preprocessing. Depending on your specific requirements, such as the sequencing technology used and the type of analysis you wish to perform, you can choose the tool that best suits your needs.
Xenome is the exact tool I use.
I tested it again today, on a bit of a meta level. I asked it for its opinion of biostars and it assumed I had some random context and went off on that context and after its detour, came back for a summary statement about biostars.
What do you think of biostars
Biostars is a great platform for bioinformaticians and researchers to ask questions, share knowledge, and discuss various topics related to bioinformatics. It is a valuable resource for both beginners and experienced professionals in the field.
Regarding the context you provided, it seems to cover a wide range of tools and resources in bioinformatics, such as HugeSeq, GCAT, CloudBioLinux, bcbio, and JBrowse. These tools are essential for various tasks in bioinformatics, such as genome comparison, data analysis, and visualization.
For example, if you are working with HugeSeq, a tool for identifying genomic variants, you might use it in combination with other tools like bcbio for processing and analyzing NGS data. Here's a simple example of how you might use HugeSeq:
Download and install HugeSeq
git clone https://github.com/StanfordBioinformatics/HugeSeq.git cd HugeSeq
Run HugeSeq on your input data
./hugeSeq.sh -i input_data.bam -o output_directory
Similarly, you can use JBrowse for genome visualization and exploration. Here's a basic example of setting up JBrowse:
Download and install JBrowse
git clone https://github.com/GMOD/jbrowse.git cd jbrowse ./setup.sh
Prepare your data for JBrowse
bin/prepare-refseqs.pl --fasta your_genome.fasta
Add a track for your annotations
bin/flatfile-to-json.pl --gff your_annotations.gff --trackLabel "My Annotations"
Open JBrowse in your web browser
firefox index.html
In conclusion, Biostars is a helpful platform for discussing and learning about various bioinformatics tools and resources. The context you provided covers a wide range of essential tools in the field, which can be beneficial for researchers and bioinformaticians.
I opened the URL and asked the question. I have no idea where it assumed some random context from.
Update- June 28th, 2023 - We've added sources and chat history for our chat service!
Now you should be able to see the documentation that was used to inform the answer that you are seeing. This should help cut down on the amount of hallucination and unverifiable information produced.
Thank you for this work ! I have tested with a question about somatic variant detection and the proposed tools are not very relevant (GATK SomaticIndelDetector is deprecated), chatGPT still got the upper hand for this one.
I have tried it out for a few days now by asking mainly some conceptual questions about biology and bioinformatics. I believe the contents generated by the model is reliable and largely overlaps with scientific literatures out there. Not sure if someone has mentioned this before, but I did notice one issue that I have repeatedly come across. Sometimes when I ask a completely different question from the first question, the LLM generates the same answer to that of my first question, which is not relevant to my second question. That being said, its not a big issue, because re-entering the second question again would solve it.
Also, in what situation does the LLM provide reference for it's answers? Sometimes there is a reference and other times there isn't.
Update- January 10th, 2024 - We've added the ability to chat with figures and graphs!
Now you should be able to copy and paste a screenshot of a figure you are trying to understand or tweak and get results back based on what you need.
Update: March 12th, 2024 - we've added the ability to execute code via chat!
You can now debug, create new graphs, and analyze data all through chat. Here is a differential expression analysis for which tinybio had written the code for and ran:
You can try it here: ai.tinybio.cloud/chat
Hey guys -
I wanted to announce some pretty large updates for chat:
Check it out here: ai.tinybio.cloud
Thanks!
Wait, tinybio seems more self-aware than I thought:
You: Why did you give me results pertaining to sarek when I never mentioned that tool?
tinybio: I apologize for the confusion. It appears there was a misunderstanding in my attempt to provide you with information. I should have conducted a search specifically for the documentation of "Xenome" rather than providing information about an unrelated tool. Let me correct that by searching for the documentation on "Xenome" in the Github document store. Please give me a moment to perform the search.I apologize for the oversight. The search results returned documents related to "Sarek" instead of "Xenome." This could be due to a limitation in the search function or an absence of "Xenome" related documents in the Github document store that I have access to.
To provide you with an accurate explanation of how "Xenome" works, I will rely on general knowledge about the tool, as specific documentation from the Github document store is not available at this time.
..
..
Still gives me the sarek links though.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
How is it trained on PubMed/papers? One major drawback of ChatGPT is that it wildly makes up papers that do not even exist but with great confidence.
Hoping to add functionality around this. Stay tuned. Big problem for our space that needs to be resolved.
this is not a problem with the algorithm.
this is, by design, the nature of how the algorithm is intended to work.
the mistake is on the part of the user, who uses it as such. this point must be stressed over and over until everyone understands.
Can you elaborate on how you trained the bot specifically on bioinformatics content? Where is the content coming from?
Hi GenoMax - it's primed on Github repos, some personality tuning, and some prompt assumptions based on the context of the site being bioinformatics/research focused. Being vertical-focused makes some things easier like knowing that when you're talking about Picard, you're not talking about StarTrek.
Since you have kept the bot openly accessible (at least for now) this is great. People can try it out and hopefully provide feedback. I tried a recent biostars question out and the answer your bot generated seemed reasonable and included code.
What's your business model?
hi Jesse - we're hoping to build custom lab specific private experiences for folks. You can check it out here and reach out if you're interested in setting something up! https://www.tinybio.cloud/paid-offerings/bioinformatics-trained-large-language-models
Great initiative :D
I have tried to use the chat tool on a couple of occasions via the web interface... and I get a client-side exception, without fail. It worked for me fairly well in the Discord as a bot, I wish you'd make that available again.
Maybe something on your network or browser is interfering with the tool?
Hi sorry to cut to the quick, I signed in for a trial, but how much is it after that? I can’t see that info. Thanks