I need to identify repetitive elements of 5 genotypes from the same species (a plant species that is not yet sequenced). For that, I have different datasets:
1) Illumina paired-end reads from these 5 genotypes 2) BAC sequencing reads (I do not know the sequencing platform, but they are 300nt long) from only one genotype.
I hesitate what to do with the first step: assembly. My initial idea was doing de-novo assembly with SOAPdenovo or assembly with specific assemblers to identify repeats such as SWA. However, I do not know how to handle sequencing data from BACs or how to combine Illumina paired-end reads with reads from BACs.
Transposome was designed to identify repeats from unassembled paired-end reads, so this would be a good choice for part 1 above. Don't hesitate to email or message me if you have questions about this toolkit. If you are not experienced with command-line programs, you may want to try RepeatExplorer, but the results will vary between the programs. The reference set of repeats you use for annotation with either program is also important to consider, and having a set of repeats from a closely related species is ideal (though, RepBase can be used with either program).
For the BAC-end reads, I would assemble those first (depending on what kind of coverage you have). Given the length of reads you have, I bet you have some 454 data that was generated a few years ago, or possibly some MiSeq data that was generated more recently. I recommend using Newbler (Roche's proprietary assembly software) or MIRA (freely available) for assembly. I have had good experience with both of these tools for assembling BACs.
If your BAC coverage is very low then unfortunately this set of reads won't add anything to what you can get from the Illumina reads alone. In the best case scenario, you would be able to assemble your BACs to identify full-length TEs and use the Illumina reads to get a non-biased sample of repeat properties in the genome as a whole.
ADD COMMENT
• link
updated 3.0 years ago by
Ram
44k
•
written 10.2 years ago by
SES
8.6k
0
Entering edit mode
I have wide experience in command-line programs and bash scripting, so it does not worry me. If I had only Illumina reads, trying Transposome would be a must. However, don't you think that the best choice here is 1) assembly of BAC sequences, 2) improve the assembly with Illumina reads, and 3) try something like RepeatMasker over the created assembly in order to find those repeats? By the way, thanks for answering!
It really depends on how many BACs you have, but I would still take the approach I mentioned. I say that because you can get an accurate estimate of repeat properties in a few minutes from just ~100k reads or so (using 1m reads would be better though) with Transposome. You will never get a genome-wide picture of repeat properties from BACs unless they are randomly selected and you have a good number of BACs, and that is something reviewers will point out.
To add to this, I wouldn't use RepeatMasker at any point. This tool is not for identifying repeats, so it would be better to use specialized tools for transposon discovery in assembled sequences, or unassembled sequences, depending on the data.
RepeatMasker is for "masking" repeats. This can tell you where genomic regions are that have similarity to sequences in a reference database. So, the results will be reliable if you have a set of sequences from that species, but it's not a good approach for discovery of transposable elements (TEs) from non-model species. For example, I can mask only 50% of the bases of sunflower TEs with RepBase. For annotation or evolutionary studies, it is best to take a more tedious and accurate approach to identifying real TEs.
I have more or less 10-15 BACs sequenced right now (I don't remember the exact number right now). So, you suggest using Transposome with the Illumina reads and ignore the BAC sequences? If you didn't mean so, what should I do with the BAC sequences?
Maybe the best approach is to combine BAC assembly + RepeatMasker in order to identify TEs annotated to databases while using simultaneously Transposome in order to identify novel unannotated TEs. What do you think about it?
It depends on what your end goal is, but my approach would be to assemble the BACs and identify TEs in those using programs for ab initio or model-based TE discovery, not a similarity-based method like RepeatMasker (which is designed for masking repeats). In addition, you would include an analysis of the Illumina data to describe whole genome properties. Taken together, you would be able to describe fine-scale structural and demographic properties of TEs from the BACs, and global properties of repeats based on abundance distributions from the WGS data.
ADD REPLY
• link
updated 3.0 years ago by
Ram
44k
•
written 10.2 years ago by
SES
8.6k
0
Entering edit mode
The goal is to compare the differences of the repetitive sequences among the 5 genotypes. Once I've done the assembly of BACs, how would you integrate the WGS data?
You would incorporate the WGS data through the use of Transposome, that will easily allow you to compare the genotypes. I would not even consider trying to incorporate the WGS into the BAC assemblies. You would have to have very high genome coverage for there to be a chance of adding anything, and even then it would be vevy time-consuming and likely add nothing to the assembly. Once you have full-length elements from the BACs you could get variation in these elements from the WGS as a supplement to the Transposome analysis.
ADD REPLY
• link
updated 3.0 years ago by
Ram
44k
•
written 10.2 years ago by
SES
8.6k
0
Entering edit mode
But Transposome would give me the repetitive elements found from the WGS data, right? How would I get the variation between elements from my BACs and elements from Transposome output, using BLAST, BLAT, or anything similar? Thank you for being so kind.
By either of those methods, or an aligner like BWA + samtools to get variants.
ADD REPLY
• link
updated 3.0 years ago by
Ram
44k
•
written 10.2 years ago by
SES
8.6k
0
Entering edit mode
In case of BWA, we would use the WGS reads instead of the entire elements, wouldn't we? However, Samtools will never allow to detect big indels or CNV. And, in case of Transposome, BLAST/BLAT?
Identify TEs in the BACs, then identify variants in the genotypes from the WGS would be one way to go. By the way, I think you asking a lot and getting confused perhaps. Try to narrow your scope to the large-scale tasks, then focus on smaller issues when they arise (rather than getting caught up on hypothetical issues). We are happy to help, but consider voting and try out the analysis first, that way you will develop more specific questions and we can better help.
ADD REPLY
• link
updated 3.0 years ago by
Ram
44k
•
written 10.2 years ago by
SES
8.6k
0
Entering edit mode
Indeed I'll vote you for the great help provided. I have not done this, since I still had questions. I'm gonna start with what we have discussed. Thank you.
I have wide experience in command-line programs and bash scripting, so it does not worry me. If I had only Illumina reads, trying Transposome would be a must. However, don't you think that the best choice here is 1) assembly of BAC sequences, 2) improve the assembly with Illumina reads, and 3) try something like RepeatMasker over the created assembly in order to find those repeats? By the way, thanks for answering!
It really depends on how many BACs you have, but I would still take the approach I mentioned. I say that because you can get an accurate estimate of repeat properties in a few minutes from just ~100k reads or so (using 1m reads would be better though) with Transposome. You will never get a genome-wide picture of repeat properties from BACs unless they are randomly selected and you have a good number of BACs, and that is something reviewers will point out.
To add to this, I wouldn't use RepeatMasker at any point. This tool is not for identifying repeats, so it would be better to use specialized tools for transposon discovery in assembled sequences, or unassembled sequences, depending on the data.
I had always thought that Repeatmasker is for annotating repeats from the assembly
RepeatMasker is for "masking" repeats. This can tell you where genomic regions are that have similarity to sequences in a reference database. So, the results will be reliable if you have a set of sequences from that species, but it's not a good approach for discovery of transposable elements (TEs) from non-model species. For example, I can mask only 50% of the bases of sunflower TEs with RepBase. For annotation or evolutionary studies, it is best to take a more tedious and accurate approach to identifying real TEs.
Oh yes, Agree. I had thought that genome is already annotated and repbase has libraries
I have more or less 10-15 BACs sequenced right now (I don't remember the exact number right now). So, you suggest using Transposome with the Illumina reads and ignore the BAC sequences? If you didn't mean so, what should I do with the BAC sequences?
Maybe the best approach is to combine BAC assembly + RepeatMasker in order to identify TEs annotated to databases while using simultaneously Transposome in order to identify novel unannotated TEs. What do you think about it?
It depends on what your end goal is, but my approach would be to assemble the BACs and identify TEs in those using programs for ab initio or model-based TE discovery, not a similarity-based method like RepeatMasker (which is designed for masking repeats). In addition, you would include an analysis of the Illumina data to describe whole genome properties. Taken together, you would be able to describe fine-scale structural and demographic properties of TEs from the BACs, and global properties of repeats based on abundance distributions from the WGS data.
The goal is to compare the differences of the repetitive sequences among the 5 genotypes. Once I've done the assembly of BACs, how would you integrate the WGS data?
You would incorporate the WGS data through the use of Transposome, that will easily allow you to compare the genotypes. I would not even consider trying to incorporate the WGS into the BAC assemblies. You would have to have very high genome coverage for there to be a chance of adding anything, and even then it would be vevy time-consuming and likely add nothing to the assembly. Once you have full-length elements from the BACs you could get variation in these elements from the WGS as a supplement to the Transposome analysis.
But Transposome would give me the repetitive elements found from the WGS data, right? How would I get the variation between elements from my BACs and elements from Transposome output, using BLAST, BLAT, or anything similar? Thank you for being so kind.
By either of those methods, or an aligner like BWA + samtools to get variants.
In case of BWA, we would use the WGS reads instead of the entire elements, wouldn't we? However, Samtools will never allow to detect big indels or CNV. And, in case of Transposome, BLAST/BLAT?
Identify TEs in the BACs, then identify variants in the genotypes from the WGS would be one way to go. By the way, I think you asking a lot and getting confused perhaps. Try to narrow your scope to the large-scale tasks, then focus on smaller issues when they arise (rather than getting caught up on hypothetical issues). We are happy to help, but consider voting and try out the analysis first, that way you will develop more specific questions and we can better help.
Indeed I'll vote you for the great help provided. I have not done this, since I still had questions. I'm gonna start with what we have discussed. Thank you.
A CNV tool wouldn't be feasible?
Yes, Once you have an assembly in fasta, run repeatmasker
Do you have the paper of Transposome? I can't find it. I would like to know any specificity or sensitivity values and how reliable it is.
The paper is in review and hopefully I can send a citation very soon.
The paper is published online in the journal Bioinformatics and it can be downloaded for free at the moment (advance access).