Hi,
So I have a repeat masker output file for a new organism (crustacean). And I want to use the transposable elements in this specie to analyse the piRNAs (using my own sequencing data=short reads).
The problem is I would like to get consensus sequences for transposable elements in this specie, instead of having each position in the genome where there is a transposon. Because if the same transposon exist in 100 copies in the genome I will have it 100 times in Repeatmasker.
Ideally I will like to get to a multifasta file like the ones in Repbase but I am a bit lost about how to use the Repeatmasker output to achieve this.
Any suggestion will be very helpful ! Thanks
I think the easiest way would be to manipulate the coordinates as a bed file and then use bedtools to extract the sequences from the fasta. Once you have the fastas you can get a consensus
Thanks for the comment. I have already extracted the fasta sequences. I guess the way to move forward would be to do some sort of clustering on the sequences but I am not just sure about that.
You should have the name of the repeat, you can start with that and then get a consensus for each group.