I'm working on the classification of transposable elements. I want to retrieve sequences of their sub-classes in separate files. Is there any code or tool present to separate their sub-families because dataset contains thousands of sequence entries for different species.
I really appreciate any help or suggestion!
DATASET SOURCE: https://pgsb.helmholtz-muenchen.de/plant/recat/index.jsp
For example:
I want to separate RLC Sequences in separate files and so far for other entries like for RLX & TXX
>RLC_163294|LTR_Gr_chr_04_982|LTR/Copia|02.01.01.05|29730|Gossypium
tgttagagtagttagtaaagttgttagtagttaaaactgttgtacgttcagttaacagttgagctgttaaatagttgacctgttagttatgcattcatttgagtataaaactatgagaagtctgtacttaaagatatgagttttataatgaagaaattctaagtctttgtttttaagctgcttgtttagcttaacatggtatcag
>RLX_163369|LTR_Gr_chr_10_2326|LTR|02.01.01|29730|Gossypium
tgtcacgggcaaaagtgcaaagcccgtgaccatggcataagatgtgccccatggaggtctatcgattagacaaggaacatttagcccacgagaacttgcccgattcaaaaaactgttggagaagcctgtcagattgaagcctggttggcccgataatgaagacgtggcaacttaggccaattttggt
>TXX_174935|TXX_Gr_DX404975.1_8351|MobileElement|02|29730|Gossypium
atccgtgcccatgccatgtcccagacatggtcttatgggggactctcatctcggtgccaacgccatatcccagacatggtcttacatgggacctctcataatctcaattatgccaatgccatgtcccagacatggtcttacatgggatctctttacccaaatatcatgacatttgtatccattacattcccaatgtttcaacggggcttttatcactgattctctgtcatctcatacttgagttaacattagatattttcatgaaataaatacataattgctggaaaatagcagcattaa
with awk (not tested on large dataset, please take a back up of your data and assumes that sequences are in single line):
with seqkit (works for multi line sequences and outputs sequences in single line):
Output files will be stored in a new directory "new_files" and files are named
test.id_RLC.fa
,test.id_RLX.fa
,test.id_TXX.fa
. Removetest_id
from all the files runningrename -n 's/test\.id_//' *.fa
in new_files folder.Please note that the regex is tightened, to be safe while running regex. If subfamily IDs star with more than 3 letters, at the start, please change accordingly.