Beginner problems: Aligning "genomes" that come in contigs
0
0
Entering edit mode
3.2 years ago
ABS • 0

tl;dr- how do you align genomes when each genome is made up of multiple contigs?

Hi everyone, I'm very new to genomic analysis and I am not even sure how to ask this question. Google is giving me little to go on. I appreciate any help.

I am working with ~20Mb protistan genomes. I want to compare them to each other via whole genome alignment. The genomes are either downloaded directly from NCBI or were generated/assembled by my collaborators.

Each "whole genome" is divided up into ~20 contigs. When I align two or three genome with Mauve via Geneious, there are no issues. Everything's concatenated and aligned to each other. However, I will eventually need to align dozens of these genomes. My little laptop does not have enough RAM to toss all the genomes into Mauve and I am currently battling my way through Mauve on command line using our institution's supercomputer.

I tried submitting the sequences to the online version of MAFFT. I end up with each contig aligned to other contigs, rather than "genome" aligned to genome.

I'm wondering if I can just take out the carrots (>) from each scaffold's fasta file and replace them with Ns?

Concatenating via Geneious requires me to know the order of the contigs in relation to each other. I'm not sure what the order is.

Help me with the vocab regarding this problem. I'm still learning.

alignment whole-genome scaffold • 2.4k views
ADD COMMENT
0
Entering edit mode

Just to clarify, you submit a single MAFFT alignment job on an online tool (link?), where you give it all your genomes of interest and it only aligns within each genome and not across?

One quick fix would be to concatenate all contigs from all genomes of interest into a single file and submit that. This may be problematic if there is a lot of repetitive material that will align within genomes.

ADD REPLY
0
Entering edit mode

Hi! I've actually tried two approaches submitting to MAFFT. 1. I give it a single FASTA containing all my genome drafts and 2. individual FASTAs for each draft genome. In both situations, MAFFT aligns individual contigs with other contigs. To concatenate, can on simple delete the carrots (>) that designate a new contig in a FASTA file? I am also looking to see if CAMSA is a good option for me. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1919-y

Again, I don't know the order to put contigs in so any software recommendations are welcome.

ADD REPLY
0
Entering edit mode

By concantenate, I meant just joining all the genomes into a single fasta file This idea of joining the contigs together to form one large psuedo-contig should work (you are essentially manually scaffolding the contigs, and you should add some N's) but if you cannot get MAFFT to function as you wish there are alternatives for multiple-whole-genome alignment.

As below, it depends what you want to know. Maybe you can try progressive cactus or progressive mauve

goodluck

ADD REPLY
0
Entering edit mode

What do you mean when you say that you want to compare the genomes to each other via whole genome alignment? Do you really want to look at some tens of millions of nucleotides long alignment? Maybe instead you're just interested in e.g. statistics, like how many percent of the bases are homologous? Maybe Minimap2's assembly to assembly alignment would give you everything you need? Genomic distances you can estimate quickly with Mash

ADD REPLY

Login before adding your answer.

Traffic: 2578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6