If I made a fake fasta with 10x the material of a book, randomly cut and spliced, and use this to align a Book, could tophat2 reconstruct the book? Or it only work with with ATCG letters?
Just a question that came out in my mind today.
If I made a fake fasta with 10x the material of a book, randomly cut and spliced, and use this to align a Book, could tophat2 reconstruct the book? Or it only work with with ATCG letters?
Just a question that came out in my mind today.
You would need to re-encode the book as ACGT. For example, 1 ASCII character is 8 bits, corresponding to 4 nucleotides if you use the simplest possible encoding (rather than trying to pack into 7 or 6.5 bits, or whatever). Thus for an ASCII-formatted text file of the book, the encoded book would be 4x as long, but the mapping would work fine.
You MIGHT be able to map to the raw book using some protein aligners, as those allow more symbols.
As Istvan said, though, you'd need to use an assembler to reconstruct the book, not an aligner.
The encoding is reversible.
https://en.wikipedia.org/wiki/ASCII
For example, "Hi" -> 01001000 01101001
-> CAGA CGGC
where 00
-> A
, 01
-> C
, 10
-> G
, 11
-> T
In principle yes, in practice it might not work out that well. TopHat is built to recognize splicing dinucleotides that show the most likely splice locations. Then everything depends on the content and the length of the pieces.
But of course to do that you would need to have a book to align against, so "reconstructing" the book does not make sense here, you already need to have the book to align with TopHat.
This is an interesting question. Instead of using a tool like Tophat, I would suggest trying Vmatch because it allows you to define any alphabet you like (not just DNA/RNA or protein). You would define the alphabet with the mkvtree
program when you create the index (suffix tree) and then you could map your words or sentences to the book with vmatch
. I imagine this approach would require less work than recoding your data or modifying an existing DNA aligner. It would be easy enough to try this, and I'm sure someone has, but I can't say I've done this myself.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This problem seem better suited for a de novo assembly program like Velvet, SGA, ALLPATHS-LG, ABySS, SOAPdenovo, etc. As suggested by Brian you may want to re-encode the book as ACGT.