Comparing Assemblies
3
5
Entering edit mode
14.3 years ago

I'm working with two transcriptome assemblies. They're essentially different versions of the same transcriptome, one of them just includes more data and is therefore (presumably) better quality. A paper was published a couple of years ago with the old assembly, and we're now preparing a manuscript with the new assembly. We want to know which contigs from the old assembly correspond to which contigs from the new assembly.

I have the consensus sequence of all of the contigs from each assembly. I used the formatdb utility to create a BLAST database of the sequences from the old assembly, and then I BLASTed the sequences from the new assembly against this database. So now I just need to process the BLAST results and determine the relationship between contigs from one version to another.

I guess another way of putting it is that we want to determine the synteny between the two versions of the assembly.

Any suggestions about how to proceed?

assembly blast • 5.3k views
ADD COMMENT
0
Entering edit mode

Maybe you would receive a more specific answer if you could tell us why you want to link the old contigs to the new ones. Are you interested in every single one or in a subset?

ADD REPLY
0
Entering edit mode

Sorry I've been out of it for over a week with a recent move. To be more specific, what I would really like is to know which contig from assembly A corresponds to which contig from assembly B. This would be easy if it was a 1-to-1 correspondence, but that is highly unlikely. The newer assembly has more sequence data and fewer contigs, so I'm assuming it is better quality. I guess I can hope for a 1-to-many relationship between the assemblies, but I don't know how I would handle a many-to-many relationship. Does this make sense?

ADD REPLY
0
Entering edit mode

Dear Daniel, did you finally manage to compare your two assemblies?

ADD REPLY
9
Entering edit mode
14.3 years ago

First of all, your data is transcriptome assembly, not chromosomal or genome assembly, therefore synteny is not applicable here. Synteny can be used to check the differences between contiguous sequences - however, transcripts are assembled separately into non-contiguous 'islands'.

I would collect a few statistics to quantify the difference, for example:

  • How much of assembly version 1 is present in 2 (coverage)?
  • Are there nucleotide differences between contig sequences of the two assemblies?
  • Are any of the changes due to different genotypes used, sequencing technologies or assembly software?

I think any sequence alignment software will do, including BLAST.

ADD COMMENT
4
Entering edit mode
14.3 years ago
Neilfws 49k

The key part of your question is:

So now I just need to process the BLAST results

There's an easy option and a more involved option. The easy option is to run BLAST so that it generates tab-delimited output. In the latest version of BLAST, use "-outfmt 6"; older versions (where you run "blastall") use "-m 8". This makes the output easy to open and process using the tools of your choice (spreadsheets, database import, shell tools...). It should be easy to extract query name, hit name and the start/end coordinates for both.

The more involved option is to write a BLAST parser in the language of your choice - but don't reinvent the wheel. All of the Bio* projects (e.g. Bioperl, Biopython and Bioruby) include libraries to do this. Bioperl, for example, used the Bio::SearchIO module and here is an introduction to its usage.

ADD COMMENT
0
Entering edit mode

I agree that this would easily produce a list of 'most similar' contigs between the new and old assembly. For unambiguous mapping, one could define a contig pair as (1) new one most similar to old one and (2) vice versa. The question is how to resolve ambiguities when there is no 1:1 mapping.

ADD REPLY
0
Entering edit mode

Sorry, I've been out of it for over a week with a recent move. @neilfws, before writing the post, I used BLAST with the -m 8 option. So I guess when I say "I need to process the BLAST results" I mean I need to know what to do with them. I feel very comfortable with parsing and text processing, but once I've loaded the data into memory, what do I do then? That's the question I was trying to ask, sorry for the confusion.

ADD REPLY
0
Entering edit mode
12.6 years ago

Go for compressed XML output with BLAST. Tabular format used to change and is not supported or recommended by some parsers any more. XML contains all information in a well-defined format and can still be converted to any other format that BLAST can produce.

ADD COMMENT

Login before adding your answer.

Traffic: 1310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6