I'm working with two transcriptome assemblies. They're essentially different versions of the same transcriptome, one of them just includes more data and is therefore (presumably) better quality. A paper was published a couple of years ago with the old assembly, and we're now preparing a manuscript with the new assembly. We want to know which contigs from the old assembly correspond to which contigs from the new assembly.
I have the consensus sequence of all of the contigs from each assembly. I used the formatdb utility to create a BLAST database of the sequences from the old assembly, and then I BLASTed the sequences from the new assembly against this database. So now I just need to process the BLAST results and determine the relationship between contigs from one version to another.
I guess another way of putting it is that we want to determine the synteny between the two versions of the assembly.
Maybe you would receive a more specific answer if you could tell us why you want to link the old contigs to the new ones. Are you interested in every single one or in a subset?
Sorry I've been out of it for over a week with a recent move. To be more specific, what I would really like is to know which contig from assembly A corresponds to which contig from assembly B. This would be easy if it was a 1-to-1 correspondence, but that is highly unlikely. The newer assembly has more sequence data and fewer contigs, so I'm assuming it is better quality. I guess I can hope for a 1-to-many relationship between the assemblies, but I don't know how I would handle a many-to-many relationship. Does this make sense?
First of all, your data is transcriptome assembly, not chromosomal or genome assembly, therefore synteny is not applicable here. Synteny can be used to check the differences between contiguous sequences - however, transcripts are assembled separately into non-contiguous 'islands'.
I would collect a few statistics to quantify the difference, for example:
How much of assembly version 1 is present in 2 (coverage)?
Are there nucleotide differences between contig sequences of the two assemblies?
Are any of the changes due to different genotypes used, sequencing technologies or assembly software?
I think any sequence alignment software will do, including BLAST.
There's an easy option and a more involved option. The easy option is to run BLAST so that it generates tab-delimited output. In the latest version of BLAST, use "-outfmt 6"; older versions (where you run "blastall") use "-m 8". This makes the output easy to open and process using the tools of your choice (spreadsheets, database import, shell tools...). It should be easy to extract query name, hit name and the start/end coordinates for both.
The more involved option is to write a BLAST parser in the language of your choice - but don't reinvent the wheel. All of the Bio* projects (e.g. Bioperl, Biopython and Bioruby) include libraries to do this. Bioperl, for example, used the Bio::SearchIO module and here is an introduction to its usage.
I agree that this would easily produce a list of 'most similar' contigs between the new and old assembly. For unambiguous mapping, one could define a contig pair as (1) new one most similar to old one and (2) vice versa. The question is how to resolve ambiguities when there is no 1:1 mapping.
Sorry, I've been out of it for over a week with a recent move. @neilfws, before writing the post, I used BLAST with the -m 8 option. So I guess when I say "I need to process the BLAST results" I mean I need to know what to do with them. I feel very comfortable with parsing and text processing, but once I've loaded the data into memory, what do I do then? That's the question I was trying to ask, sorry for the confusion.
Go for compressed XML output with BLAST. Tabular format used to change and is not supported or recommended by some parsers any more. XML contains all information in a well-defined format and can still be converted to any other format that BLAST can produce.
Maybe you would receive a more specific answer if you could tell us why you want to link the old contigs to the new ones. Are you interested in every single one or in a subset?
Sorry I've been out of it for over a week with a recent move. To be more specific, what I would really like is to know which contig from assembly A corresponds to which contig from assembly B. This would be easy if it was a 1-to-1 correspondence, but that is highly unlikely. The newer assembly has more sequence data and fewer contigs, so I'm assuming it is better quality. I guess I can hope for a 1-to-many relationship between the assemblies, but I don't know how I would handle a many-to-many relationship. Does this make sense?
Dear Daniel, did you finally manage to compare your two assemblies?