Question

Regarding using a "pseudo reference genome" for aligning short reads

0

Entering edit mode

21 months ago

pixie@bioinfo ★ 1.5k

Hello, my lab is involved in variety identification of plants. Usually we have over 200 varieties which we barcode during the genotyping by sequencing run. This is primarily a "radseq" method, i.e. it is a reduced representation of a genome, where we do not sequence the whole genome. In the event that a plant species is quite rare, and we do not have a reference genome, we tried to use the program Stacks with a denovo approach. What we saw is that the RADSeq approach for a denovo method is a total no go. Is there any way where we use a "hybrid approach" where we do a WGS or long-read only on one variety and use that as a "pseudo-reference" to just align the stacks for the rest of the varieties and then do a SNP calling between the stacks, independent of the pseudo reference ?

Thanks

genomics • 1.4k views

ADD COMMENT • link 18 months ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

It sounds reasonable to me to WGS one and make a full assembly, then call variants on the rest by aligning them to the assembly (which at that point I would call a "reference" rather than "pseudo-reference"). Plant assembly can be difficult, particularly with high ploidies, so don't expect a great assembly... but I would consider that the best approach.

Once you have an assembly, you can also make it more general by creating a consensus using all the alignments from all libraries. This will have the advantage of reducing the size of your VCF files since minor alleles in your assembly will be replaced by major alleles (in the sequenced areas).

ADD REPLY • link 21 months ago by Brian Bushnell 20k

score 2 · Accepted Answer · 2023-11-22

2

Entering edit mode

21 months ago

colindaven 7.8k

Yes, I agree with Brian above that a long read full assembly is the way to go.

Smaller plant assemblies are not all that challenging these days with great tools like Shasta and Flye (ONT) or hifiasm (pacbio). Especially if they are <1 GB in size.

I think higher ploidy and large >3 GB assemblies are mostly doable with Hi-C as well. For your use case you'll probably be happy with recent ONT 10.4.1 or Pacbio Revio data, which should get you a high quality contig assembly.

Pacbio is easiest to get a high quality assembly from if you don't have major assembly experience in your lab. However ONT provides longer reads, which may help with greater contiguity if you have many repeats.

ADD COMMENT • link 21 months ago by colindaven 7.8k

0

Entering edit mode

Thanks both of you, I have conveyed this to my lab. We are now going to contact vendors for PacBio fpr our model plant , Tomato. We do have an inhouse Minion, but thats used for viral transcriptomics and not really genome assembly. What kind of coverage should we target to map the small reads ?

ADD REPLY • link 18 months ago by pixie@bioinfo ★ 1.5k