Hello all :)
I hope it is OK that I made a new thread for this, because the old thread lost focus somewhat and became a long discussion about existing k-mer tool performance (which is OK, I think that was a useful conversation to have also).
Anyway, the tool I started writing to do k-mer analysis for all lengths of k is almost finalized, and now we are looking at ways of using this information to build visualizations of DNA composition. I don't really want to talk about the tool or how it works just yet (although you can read about it on github.com/JohnLonginotto/ACGTrie/, and if you want to contribute i'd be very happy!) because the specifics might change in the future as a fellow opensource developer Jim Bailey is working on the format and improving it dramatically - but for today I really just want to talk about ways to use and visualize such data.
So! Imagine there exists a file which you can query to very quickly get counts for all substrings from your sequenced data (BAM/FASTA/etc). So you ask how many "A"s there were, or how many "AT"s, or "ACTAGCCATGACAGTCTATCTAGTCTA"s, and immediately get back a number for how often that sequence appeared in all of your reads. How would you use such data?
One thing that might be nice is looking at over-represented sequences. Sequences which have more than the average number of counts for that length of sequence. This is probably better served as a table or bar chart. Another might be to just look at the composition of the DNA as a whole, to see how many reads were in repeat regions rather than regions of high-complexity, or some other more general questions. For this I see interactive visualizations being more appropriate.
Working with another really fantastic opensource developer Andrei Kashcha (you may have seen his talks if you are a JavaScript/graph nerd like myself), we are trying to find a interactive visualization that helps people immediately see the overall composition of the DNA. So far in the last... well only 7 days... we have come up with 2 ways of looking at the data. The first was plotting it as a graph, where the position of the nodes is based on fractals. It didn't really work so well at the higher lengths of sequence because the nodes became too small and there were just too many to plot, but their is definitely an idea in there somewhere:
Then a day later Andrei made this - kind of the same idea but with 2D fractals in a 3D space: http://anvaka.github.io/actg/ WASD and arrows to move around (although i'd just stick to left and right arrows), and typing into the search box will draw lines between nodes showing the path it takes. Only has mers up to 5 in length because again its a lot of dots and this is just a quick demo.
In this visualization it is really obvious that there is not nearly as many CpGs at there are other 2-mers, and any mer with CpG in it (ACG/CGT/TGCGT/etc) which obviously reflects the biology and is very striking. However, Andrei and I both agree it is still far away from an optimal way to get a feel for the data.
At the end of the day, the optimum way of displaying data depends on what kind of information is most important to you - and that depends on what kinds of questions you want to ask...
I know what kind of data I would like to see, but there may have been obvious (and non-obvious) things I haven't thought of, which is why I'm asking here! What would you want to get out of a DNA composition visualization? How would you, in an ideal world, visualize the data?
Thank you so much for your time. This was a pretty long post and I appreciate you reading all of it :)
Oh, I like the idea of the colour intensity/alpha being the relative abundance for that sequence, while the size is the absolute abundance. I will give a treemap a go :)