Hello! I'm a new Bioinformatic scientist working for a yeast genetics company and they recently proposed a project they would like me to work on and I was hoping I could get some help from this community. I'm the only one in the company that has any bioinformatic experience so any guidance, tips, tools, etc. that you can lend will be appreciated!
My first task is to create a database of yeast genomes and identify possible variations/mutations in the data; specifically looking to identify possible SNPs in specific genes.
I started out by writing code to download all publicly available yeast genomes from NCBI and creating a local BLAST database with those sequences. In my pipeline I then hope to incorporate FastQC, Trimmomatic, and BWA to check the quality of the downloaded reads, trim off any unnecessary information, and to align them. From there, I think I'd like to use the GATK pipeline to identify potential variations.
Most of the data I have from NCBI comes from Illumina sequencing while the data we have sequenced has come from PacBio. I don't think this is too much of an issue - but I wanted to make sure there wasn't anything else I should consider with one or the other. I also don't plan on assembling any of the reads I have as I'm just looking to comb through to find variations.
This is a long way of asking if this seems like a reasonable plan and if anyone out there sees anything glaringly obvious that I should avoid or am missing from this potential setup. It would be nice to have some guidance as I'm kind of alone in this work now but I'm hoping this helps that! Thank you!
Thank you! I've started looking over the bcftools manual page and it seems way simpler than GATKs pipeline.
Eventually I want to be able to access all publicly available yeast genomes, identify SNPs in certain genes, and BLAST them against each other to identify what specific yeast strain it was. I figured setting up a BLAST database would help with that down the line. Would you instead recommend that I just download the reads, run a quality control on them (i.e., FastQC), trim them, align them, run them through something like bcftools, and then do a separate BLAST on whatever SNPs I find?
basically, the reason people are a bit puzzled by the choice of BLAST is that it would not be the appropriate tool to employ at the start to identify variation. BLAST is a local aligner, after all.
Later you may find a use for BLAST for other needs, and that is fine. We are just trying to answer the main question here.
In other words - use blast for comparing assemblies (i.e. your contigs), not for read alignment or comparison.