I have in hands the results of a 454 sequencing experiment, and I am trying to reassemble the transciptome it represents. I would like to know what is the general consensus on whether repetitive elements within the reads should be masked prior to assembly.
My own tests seem to show that not masking the reads lead to erroneous constructs, but I am looking for better informed opinions or litterature that would help me wrap my head around the issue.
i see where you are coming from but i don't think this is a necessary step
repeats foil all assemblies but most assemblers (even Overlap Layout Consensus like Newbler) understand not to pursue overlaps greedily in the face of ambiguity, they just break the assembly
let's say you had two transcripts
A-R-B-R-C
A-R-D
And no pair spanned a repeat, I think you would probably get the following contigs:
A-R
R-B-R
R-C
R-D
So now at least you have some, albeit ambiguous, overlaps that could assist scaffolding. If you mask the repeats then you lose those, too.
That having been said I have not tried masking repeats, so you might be onto something.
What assemblers have you tried for this task? I have to say that, unless you are working on extremely repetitive sequences, and you know that the transcripts you are interested in are not in those regions, maybe you could consider masking them out.
But there are some very good assemblers out there which do a pretty good job at assembling repeats (like MIRA). Specially if you are working with long 454 reads (average length?).
Could you explain in more detail the strange behaviour you observe in your transcript constructs? We might then be able to circumvent the problem.
I've been using CAP3 for assembly. I've considered trying alternate assemblers (MIRA, especially), but I've not yet had the time to do so. However, regardless of the assembler being used, I figured that "to mask or not to mask" will still be a valid question and that I might as well make up my mind about it right now.
The 454 reads aren't particularly long. After cleaning and removal of short reads, they average ~270bp.
I might have spoken too fast when talking about "Erroneous constructs", given that I haven't come up with any objective quality metrics. The assemblies certainly are different, though.
Some transcripts end up being (correctly) longer when using masked data. Other end up broken up in pieces, even though they do not contain repeated elements. Some of the transcripts with most reads in the non-masked assembly also are apparent misassemblies: repeated regions A-B-C form a single transcript, whereas they are only found as A-C in reference sequences.
Regarding read length, I actually wanted to know the length of the repeats compared to the read length. Are we talking about much smaller reads than the read length or the other way around?
On the "erroneous" part, have you looked at the methods in the papers for the reference sequences? By this, I mean, that if they all have masked their repeats and observe A-C, this doesn't mean that A-C is the correct assembly, only that it is the assembly you obtain when repeat masking...
Side question : which assembler are you using? have you tested several?
All repeats in transcriptome reads come from transcriptionaly active retrotrasposons and you do not want them if your goal is to reconstruct the coding part of your organism of interest.
I can see the merit of keeping them in genomic assemblies but it is just noise in the transcriptome assemblies.
Again if you goal is NOT to study repeats I find no reason for including them in the reads that go to the assembler.
Do a good search of your reads against an established repeat database of the organism of interest and clean up reads that matching repeats.
The examples the previous contributors have presented (e.g. that repeats might help to bridge contigs) are valid for genomic assemblies and not for transcriptome.
I am very much interested in the answer for this question.