Question

genome finishing

0

Entering edit mode

6 months ago

trezini • 0

I'm new to bioinformatics so I apologize in advance. I have a pipeline for genome assembly, the output I get is a de novo assembly file SPAdes, contigs.fasta, as well as a variant file vcf.fasta, I need to use the information about the variants from vcf.fasta and make these changes to contig.fasta, correct do I think? if yes, how can I do it better? I need to somehow combine information from vcf.fast and contigs.fast to get a consensus sequence.

finishing genome • 477 views

ADD COMMENT • link updated 11 days ago by sihan.bu • 0 • written 6 months ago by trezini • 0

0

Entering edit mode

I may be wrong here so someone feel free to correct me, but I've never noticed a variants file from SPAdes, and certainly never used one if there was, but the contigs.fasta should already be the 'best' calls for each position thus any consensus you make will be of a lower overall quality.

Why do you think you need a consensus sequence?

ADD REPLY • link 6 months ago by Joe 21k

score 2 · Accepted Answer · 2024-05-15

What is the input? Do you have short reads (e.g. illumina 100bp paired reads)? As mentioned already, you already have the consensus i.e. contigs.fasta, which is the final output (that's the one I use from SPADES or other assemblers). I imagine the vcf refers to potentially heterozygous sites in your assembly. Now if you are interested in error-correcting the assembly further, I would suggest PILON [https://github.com/broadinstitute/pilon]. I use PILON when my de novo assembly is the product of short + long reads. I typically assemble the long reads first (without the short reads), map the the short reads back to the long read assembly, and then use the mapped BAM file as input in PILON to error correct my assembly. I finally run QUAST and BUSCO on my error corrected assembly to assess its performance.

There are many steps that you can perform if you are looking to produce the highest quality assembly given the data you have, but for me to give you a better approach, let me know,

Your estimated genome size (prokaryote or eukaryote).
Your input data types (short reads, long reads, length, paired/single, number of reads and/or coverage).
Do you also need to estimate heterozygosity?
Do you know the estimate repeat content of your genome?
Are you suspecting contamination?
What is your end goal, i.e. what would you consider "good enough" and what subsequent analysis are you planning to do once you have your assembly?

All the best.