Question

Visualizing Differences Between A Lot Of Sequences

7

Entering edit mode

14.9 years ago

hadasa ★ 1.0k

I have two datasets of sequences that have been derived using two different workflows, therefore there are differences in the residues for some sequences. I would like to visulize neatly where such differences occur. any ideas of a software or approach that can do this. Initially i had thought of just doing a multiple alignment. however given the large number of sequences in both datasets that is not very 'clean' way for visualizing the differences. Any ideas?

visualization sequence • 5.4k views

ADD COMMENT • link updated 14.9 years ago by Khader Shameer 18k • written 14.9 years ago by hadasa ★ 1.0k

1

Entering edit mode

Hi, I modified the title of this question because it wasn't clear that you were asking for a specific solution in the case of having lot of different sequences. Please make a rollback if you preferred the previous title. I would recommend you to explain something more on your sequences: are they short? are they mostly similar, how many differences you expect to find? how many sequences do you have, more or less?

ADD REPLY • link 14.9 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Am happy with the edit since the problem can be generalised. Secondly the sequences i have were long and the need is to locate the positions at which they differ and which residues actualy contribute to this difference. I had attempted to use an MD5 checksum but though i coudl easily see the number of different sequences, it is difficult to visualize at which positions the differeces occur.

ADD REPLY • link 14.9 years ago by hadasa ★ 1.0k

score 4 · Answer 1 · 2010-05-18

Not sure if it applies to your data set, but assuming you have some reference sequence(s), be it genome or transcriptome:

map your set A & B separately to reference
convert the output to SAM format if needed
convert from SAM to BAM using samtools
index BAM files with samtools
load reference + 2 BAMs into IGV viewer

This works for millions of NGS Illumina reads, but obviously is meant only to give you some general idea about sequence differences. No way you can compare 20k transcripts by just looking at them and noticing the differences...

edit: improved points mangled during formating

score 4 · Answer 2 · 2010-05-18

You could generate a PCA plot based on the normalized alignment score or similar scores derived from the alignments of two datasets to visualize the overall diversity between sequence datasets. I have used PCA plots for visualization of similarity/diversity in protein sequence families.

Sample PCA plots are uploaded for SH2 and SH3 protein domain families based on normalized alignment scores.

Edit:

Recently heard of Circoletto for the visualization of sequence similarity with Circos.

score 3 · Answer 3 · 2010-05-18

I would simply open them with a software to visualize alignments.

My favorite one is Jalview, but there are many others: many people use CLC Sequence viewer, which is proprietary but free to use, or mega that can also visualize sequences..

All of these software can do the alignments by themselves, after loading the sequences.. in alternative, you can do the alignment with another software first, if you want to customize some parameters.

score 3 · Answer 4 · 2010-05-18

for each sequence, calculate its md5 sum. That will generate a random string that correspond to that sequence of character and it is very unlikely to correspond to other sequences.

Group the sequences by their md5 sum. this will allow you to recognize all the sequences that are identical, so you can concentrate on the ones that differ. You can use a sequence viewer for doing that, with the md5 you can keep only one sequence when there are multiple identical ones.

If you still have a lot of different sequences in each md5 class, you can train a markov model on the whole alignment (or simply create a custom markov matrix) and then assign a score to each sequence, by applying the same markov model. You can calculate other parameters, e.g. the GC content, or else. Then, you can cluster your sequences in bins of similar markov scores, and you will be able to see the sequences that are very different from the average.

score 3 · Answer 5 · 2010-05-18

Are you interested in where the residue differences occur, or the residue differences themselves, where they do occur?

I think that the Sequence Logo approach could serve both purposes if the sequences are short. If they're long, and the differences are sparse along the sequence, you could cut the MSA into smaller pieces for the sites where the residues differ and make individual sequence logos for each site.

If you're looking to visualize and compare NGS genome assemblies, try Tablet or MagicViewer.