Question

Resources to learn genome assembly workflow for small genomes (like viruses)

0

Entering edit mode

4.3 years ago

rezaeir75 ▴ 40

I have sequencing data of a few samples of a DNA genome virus. I'd like to learn de novo assembly of the short reads, making scaffolds from it, and then counting the abundance of each strain in the data. I heard about SPAdes as a good choice for these kinds of very short genomes. and also BBmap for statistics related to contigs.

I am a complete newbie in everything related to de novo assembly. So, I wonder if you could recommend some good courses/tutorials/papers for learning all the steps, including units, special vocabularies, analysis steps, and everything else.

I have some experience with analyzing RNA-seq and also small RNA-seq data. I learned these two using free online resources, however, it is very hard to find even one full workflow post/paper on genome assembly.

Note: I know that I should read the documentation on SPAdes and BBmap and I will. However, I'd like to first get to a good idea of what are the steps, test it on some sample data, and then go through learning different tools. So, it is like saying that I need to learn RNA-seq analysis not read a particular aligner documentation.

cross-posted in bioinformatics.stackexchange.

Assembly genome sequencing • 3.6k views

ADD COMMENT • link updated 4.3 years ago by KH ▴ 90 • written 4.3 years ago by rezaeir75 ▴ 40

0

Entering edit mode

ARCTIC Consortium: https://github.com/nf-core/viralrecon
Methods tools used by Nextrain Project: https://nextstrain.org/docs/getting-started/introduction#open-source-tools-for-the-community
Resources from CDC: https://github.com/CDCgov/SARS-CoV-2_Sequencing#bioinformatics

ADD REPLY • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

Thank you so much for the links. The first one is very interesting. I'll read their github repo, read their paper, and maybe even try it using some sample data.

ADD REPLY • link 4.3 years ago by rezaeir75 ▴ 40

0

Entering edit mode

Please don't cross post. It's annoying.

ADD REPLY • link 4.3 years ago by Ram 44k

0

Entering edit mode

Sorry. It is much harder to get an answer in bioinformatics sites compared to stackoverflow! I won't do it again

ADD REPLY • link 4.3 years ago by rezaeir75 ▴ 40

0

Entering edit mode

There are fewer bioinformaticians (and fewer still with the exact domain knowledge) than CS folks, so yeah, it takes longer. Please have patience and keep trying to find answers by yourself in the meantime.

ADD REPLY • link 4.3 years ago by Ram 44k

0

Entering edit mode

I'd like to add this very nice post myself.

ADD REPLY • link 4.3 years ago by rezaeir75 ▴ 40

score 1 · Answer 1 · 2020-08-16

1

Entering edit mode

4.3 years ago

KH ▴ 90

If you want some info on the basics of genome assembly, here are some resources you could use:

Coursera has a course on Algorithms for DNA Sequencing and in the second half they cover some of the basic algorithms behind genome assembly. While the algorithms used in current tools are more complicated than this, it gives you a good idea of the theory of genome assembly and the issues biologists run into for it (e.g. repeats cause problems when using short reads). LINK: https://www.coursera.org/learn/dna-sequencing

The Bioinformatics DotCa YouTube channel is also great for leaning the fundamentals of many bioinformatic analyses. Here is their video on genome assembly

For a little more hands-on tutorial, it looks like you can go through how to use a genome assembler called Velvet within Galaxy (word of warning, I've only glanced at this and I'm not sure how helpful this will be for you) https://galaxyproject.github.io/training-material/topics/assembly/tutorials/general-introduction/tutorial.html#get-the-data

Hope this helps.

ADD COMMENT • link 4.3 years ago by KH ▴ 90

0

Entering edit mode

Thanks a lot for your answer. I'll check the coursera link but I've previously seen some of their bioinformatics content and it is very basic and I'd rather a combination of theory and hands-on practice. I'll definitely check the link to Bioinformatics DotCa channel. Regarding the Galaxy tutorial, I'm good in using shell, R, and python, and I don't like at all to use a user interface because it is so limiting and boring to use.

ADD REPLY • link 4.3 years ago by rezaeir75 ▴ 40

0

Entering edit mode

While there's no denying the power of the command line, the BIRCH system gives you the best of both worlds: An intuitive object-oriented set of GUI applications that run hundreds of common bioinformatics programs, which can also be run at the command line when that is more convenient.

Assembling a genome consists of many steps, which means running different programs from authors with different ideas about how to specify things like pairs of read files, at the command line. At each step (trimming, error correction, assembly, quality evaluation) you have to read the documentation and experiment a bit just to get this part of the command right. Multiply that by the large number of programs needed to get an assembly, and you'll be spending weeks fighting a new battle every day.

Because BIRCH comes with hundreds of bioinformatics program pre-installed, you save the often painful process of installing (and maybe compiling) each program, again from different authors. The BioLegato family of applications let you quickly experiment with each program in the pipeline, and provides you with numerous ways to evaluate the results at one step before moving on to the next.

If you're interested in learning about genome assembly, try out the BioLegato Genome Assembly tutorial. You can also see this in action in the BIRCH - Desktop Bioinformatics (long version) on our YouTube channel.

ADD REPLY • link 4.3 years ago by brian.fristensky ▴ 460

0

Entering edit mode

This is very interesting. Is BIRCH both local and Open Source? If it is, I'll definitely learn how to use it. Also thanks for the tutorial link.

ADD REPLY • link 4.3 years ago by rezaeir75 ▴ 40

0

Entering edit mode

All Open source, and local. There are a few exceptions to the "local" part, in that you can run a BLAST+ search either using locally-installed copies of NCBI databases, or send the search to be run at NCBI. In either case, output pops up on your screen.

As further enticement, BIRCH now sends email messages to you when a long-running job (eg. correcting sequencing reads, genome assembly) is done. (You would need to set up a mail server on your computer to do that.)

ADD REPLY • link 4.3 years ago by brian.fristensky ▴ 460

0

Entering edit mode

Thank you so much for introducing this tool. I also checked the BIRCH website and it seems very promising. I'll definitely try it but because I use WSL2 instead of a full linux, it may be harder to use a GUI software.

ADD REPLY • link 4.3 years ago by rezaeir75 ▴ 40