Short Read Data Genome Assembly
2
2
Entering edit mode
14 months ago
Umer ▴ 130

Hello,

I have recently started working on genome assembly of a fungus genome.I have illumina short read sequencing paired-end (2x150bp) data taken from NCBI. Based on this data, I am trying to set up pipeline for genome assembly, which can later be used for our upcomming sequencing data.

Going through multiple litrature papers and tutorials, I made this workflow.

  1. FastQC data check and Data Trimming(if needed)
  2. De novo genome assembly using spades (as no reference genome is available) -> contigs.fasta
  3. contigs.fasta Quality check with QUAST and BUSCO
  4. RepeatMasking and RepeatModeling
  5. Annotation of assembly

As every tutorial just ends on these 4 steps, my queries are

  1. Spades gave ma a contigs.fasta file. Is their any method to make scaffolds from this (contigs.fasta) file. can this be done based n just the illumina short read data ?
  2. Is it necessary to turn contigs -> scaffolds if only short read data is available ? or the contigs.fasta can be used for further processing?
  3. Is repeatMasking and RepeatModeling are two different steps of one ?
  4. Is there anything or anyother analysis that should be done.

If you think these are naive questions, just know that I am new to genome assemblies. learning and trying to understand the steps which most of the tutorials/publications don't mention.

spades genome-assembly • 2.6k views
ADD COMMENT
4
Entering edit mode
14 months ago
alex.zaccaron ▴ 470
  1. SPAdes also outputs a scaffolds.fasta, which has some contigs arranged into scaffolds. You can use this file for dowstream analyses. In general, scaffolding with only short reads does not give big improvements.
  2. Not necessary.
  3. Not familiar with RepeatModeling, but they should refer to the same step of masking repeats in the genome. For novel species, you will need to identify repeats de novo. RepeatModeler is a good tool for this, but there are other options, like EarGrey and EDTA.
  4. Depends on what you want to do. Usually, the next steps involve gene prediction and annotation.
ADD COMMENT
1
Entering edit mode

This should be the answer

ADD REPLY
0
Entering edit mode

thank you. I did get a scaffold.fasta file but number of scaffolds were just 2/3 lower than number of contigs.

for the last point (4): My initail objective is to compare genome assemblies within species. what I'm planning to do is to create

  1. Short read assembly (some Samples)
  2. Long Read Assembly (some Samples)
  3. Hybrid assembly (some Samples)
ADD REPLY
2
Entering edit mode
14 months ago
ccstaats ▴ 40

I would like suggest to use Funannotate. It is pretty straighfforward and do all the work of predict genes and annotate them. In the previous step, the pipeline can repeat mask you genome assembly.

For scaffolding, Spades in fact produces some rearrangements. But if you have a reference genome of an assembly from a phylogenetic close organism, consider using ragtag. Also, very useful. Best, Charley

ADD COMMENT
0
Entering edit mode

Yes, as im working with a fungus genome, i foundthat funannotate is the good way to go for annotation. So far i am following this path

contigs -> Clean (contigs >= 500bp) -> sort (big to small) -> Mask -> train -> Predict -> update -> Annotate

As far as i understand, the TRAIN part requires transcript. assembly created from RNA-seq data. Correct me if i am wrong here.

ADD REPLY
1
Entering edit mode

You don't need to train if you have a phylogenetically close organism in the Funannotate DB. Please take a look into this tutorial

ADD REPLY

Login before adding your answer.

Traffic: 2015 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6