Question

SNP Pipeline Help

0

Entering edit mode

2.3 years ago

rimo ▴ 10

Hello! I'm a new Bioinformatic scientist working for a yeast genetics company and they recently proposed a project they would like me to work on and I was hoping I could get some help from this community. I'm the only one in the company that has any bioinformatic experience so any guidance, tips, tools, etc. that you can lend will be appreciated!

My first task is to create a database of yeast genomes and identify possible variations/mutations in the data; specifically looking to identify possible SNPs in specific genes.

I started out by writing code to download all publicly available yeast genomes from NCBI and creating a local BLAST database with those sequences. In my pipeline I then hope to incorporate FastQC, Trimmomatic, and BWA to check the quality of the downloaded reads, trim off any unnecessary information, and to align them. From there, I think I'd like to use the GATK pipeline to identify potential variations.

Most of the data I have from NCBI comes from Illumina sequencing while the data we have sequenced has come from PacBio. I don't think this is too much of an issue - but I wanted to make sure there wasn't anything else I should consider with one or the other. I also don't plan on assembling any of the reads I have as I'm just looking to comb through to find variations.

This is a long way of asking if this seems like a reasonable plan and if anyone out there sees anything glaringly obvious that I should avoid or am missing from this potential setup. It would be nice to have some guidance as I'm kind of alone in this work now but I'm hoping this helps that! Thank you!

python SNP pipeline yeast genetics • 2.1k views

ADD COMMENT • link 2.2 years ago by rimo ▴ 10

score 1 · Answer 1 · 2022-08-31

1

Entering edit mode

2.3 years ago

Istvan Albert 102k

I would recommend a simpler to run variant caller, take bcftools for example, it is easier to run. Much of GATKs features were developed for studying human genetic variation.

There is also no need to start with generating a BLAST database - even in your proposed plan blast does not seem to be used.

The main challenge is always filtering the variants and deciding what to trust within the VCF files. So get to the multi-VCF files then learn about filtering those files, again bcftools, vcftools, vcflib, and others.

ADD COMMENT • link 2.3 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you! I've started looking over the bcftools manual page and it seems way simpler than GATKs pipeline.

Eventually I want to be able to access all publicly available yeast genomes, identify SNPs in certain genes, and BLAST them against each other to identify what specific yeast strain it was. I figured setting up a BLAST database would help with that down the line. Would you instead recommend that I just download the reads, run a quality control on them (i.e., FastQC), trim them, align them, run them through something like bcftools, and then do a separate BLAST on whatever SNPs I find?

ADD REPLY • link 2.3 years ago by rimo ▴ 10

1

Entering edit mode

basically, the reason people are a bit puzzled by the choice of BLAST is that it would not be the appropriate tool to employ at the start to identify variation. BLAST is a local aligner, after all.

Later you may find a use for BLAST for other needs, and that is fine. We are just trying to answer the main question here.

ADD REPLY • link 2.3 years ago by Istvan Albert 102k

1

Entering edit mode

In other words - use blast for comparing assemblies (i.e. your contigs), not for read alignment or comparison.

ADD REPLY • link 2.2 years ago by colindaven 7.0k

score 1 · Answer 2 · 2022-09-01

1

Entering edit mode

2.3 years ago

colindaven 7.0k

I recommend freebayes as a good, stable variant caller. It needs post calling VCF quality filtering, but is a lot simpler and faster than GATK.

Add multiple bams to the freebayes command lines (eg all the 20+ you downloaded from NCBI) and you'll get a big VCF with genotypes of each sample (bam) across each position.

Remember to add read group info to each bam when aligning though to make this possible.

Blast isn't a good pipeline tool, but is great for your colleagues to do ad hoc checks. Check out https://github.com/wurmlab/sequenceserver for this.

You may need a different variant caller for Pacbio reads. Pacbio is generally excellent for assembly, but at least some time ago was less good for SNV calling.

ADD COMMENT • link 2.3 years ago by colindaven 7.0k

0

Entering edit mode

Thank you for your response! I started out looking at freebayes but I couldn't really figure it out for my data.

What makes BLAST a bad pipeline tool? I'm just curious - I've used BLAST commands in my pipelines in college but if there is a better/more efficient tool to use that would be great.

What is the main difference between PacBio and Illumina? I know Illumina results in shorter contigs and PacBio produces long reads after sequencing but is one better than the other? Especially for this sort of application -- mainly variant calling?

ADD REPLY • link 2.3 years ago by rimo ▴ 10

2

Entering edit mode

In the last years PacBio has improved and is able to produce HiFi reads (high fidelity) reads. If your sequencing is done with HiFi reads then your PacBio reads will be both longer and more accurate than Illumina reads.

You don't even need a traditional variant caller, a single read alignment can give you directly the differences.

When using HiFi reads the only potential downside of PacBio would be when sequencing pooled populations, the much lower coverage would limit what can be detected.

But if your data is the old-style PacBio - those are long and error-prone reads, then you'd have a hard time calling any variants from that data. In that case, Illumina data will be far superior.

ADD REPLY • link 2.3 years ago by Istvan Albert 102k

0

Entering edit mode

We were able to use the HiFi reads for PacBio! So in that case I can just align them with whatever tool I decide to implement (i.e., minimap2, Bowtie2, BWA, etc.) and be able to directly see where the differences lie? In that case what would be the best way to visualize that data? I have a subscription to GeneiousPrime (which I know also has a SNP variant workflow but I would like to use free software in the pipeline I design) so I know I can visually see those alignments with that software but are there any others that would work just as well?

ADD REPLY • link 2.2 years ago by rimo ▴ 10

1

Entering edit mode

Freebayes is probably the easiest SNP caller out there - from their github page:

#single sample
freebayes -f ref.fa aln.bam >var.vcf

#multisample for a nice big comparative "table" vcf with 3 genotypes
freebayes -f ref.fa first.bam second.bam third.bam >var.vcf

ADD REPLY • link 2.2 years ago by colindaven 7.0k

1

Entering edit mode

Oh wow that is super simple thank you! I'm going to look into that again.

I'll probably end up using either freebayes or bcftools for SNP calling with the Illumina data then.

Do you know of any visualization tools that would help with analyzing the final result? I've seen a lot of people recommend IGV to analyze SNP differences but I wasn't sure if there was something better out there.

ADD REPLY • link 2.2 years ago by rimo ▴ 10