genome finishing
1
0
Entering edit mode
6 months ago
trezini • 0

I'm new to bioinformatics so I apologize in advance. I have a pipeline for genome assembly, the output I get is a de novo assembly file SPAdes, contigs.fasta, as well as a variant file vcf.fasta, I need to use the information about the variants from vcf.fasta and make these changes to contig.fasta, correct do I think? if yes, how can I do it better? I need to somehow combine information from vcf.fast and contigs.fast to get a consensus sequence.

finishing genome • 477 views
ADD COMMENT
0
Entering edit mode

I may be wrong here so someone feel free to correct me, but I've never noticed a variants file from SPAdes, and certainly never used one if there was, but the contigs.fasta should already be the 'best' calls for each position thus any consensus you make will be of a lower overall quality.

Why do you think you need a consensus sequence?

ADD REPLY
2
Entering edit mode
6 months ago
nd48 ▴ 30

What is the input? Do you have short reads (e.g. illumina 100bp paired reads)? As mentioned already, you already have the consensus i.e. contigs.fasta, which is the final output (that's the one I use from SPADES or other assemblers). I imagine the vcf refers to potentially heterozygous sites in your assembly. Now if you are interested in error-correcting the assembly further, I would suggest PILON [https://github.com/broadinstitute/pilon]. I use PILON when my de novo assembly is the product of short + long reads. I typically assemble the long reads first (without the short reads), map the the short reads back to the long read assembly, and then use the mapped BAM file as input in PILON to error correct my assembly. I finally run QUAST and BUSCO on my error corrected assembly to assess its performance.

There are many steps that you can perform if you are looking to produce the highest quality assembly given the data you have, but for me to give you a better approach, let me know,

  1. Your estimated genome size (prokaryote or eukaryote).
  2. Your input data types (short reads, long reads, length, paired/single, number of reads and/or coverage).
  3. Do you also need to estimate heterozygosity?
  4. Do you know the estimate repeat content of your genome?
  5. Are you suspecting contamination?
  6. What is your end goal, i.e. what would you consider "good enough" and what subsequent analysis are you planning to do once you have your assembly?

All the best.

ADD COMMENT
0
Entering edit mode

May I please know which software you used to achieve this "assemble the long reads first (without the short reads), map the short reads back to the long read assembly" ?

Thank you for your help!

ADD REPLY

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6