I have used Tandem Repeats Finder (TRF) for tandem repeat search in my fasta files.
Output looks like this:
Sequence: ENSG01
Parameters: 2 5 7 80 10 50 2000
1053 1139 4 22.2 4 67 2 62 28 4 48 18 1.68 GAGT GAGAGAGTGGGTTAGAGAGTGAGTGAGCCAGTGAATGAGTGAGTG
1069 1137 20 3.6 19 74 7 71 28 5 47 17 1.70 GAGTGAGTGAGCCAGGAGT GAGTGAGTGAGCCAGTGAATGAGTGAGTG
1619 1746 8 16.8 8 65 9 60 27 1 52 18 1.55 GAGT GAGTGAGTGAGTGAATGAGTGAATGGGAGT
Sequence: ENSG02
Parameters: 2 5 7 80 10 50 2000
Example explanation:
Sequence (ENSG01) - fasta name
Column 14 (GAGT) - repeat unit
Column 15 (GAGAGAGTGGG) - repeat sequence
Help I need:
How to process such file:
- Remove sequences that don't have repeats (like
ENSG02
) in them? Combine fasta name with the following repeat data?
For the output like this:
ENSG01 1053 1139 4 22.2 4 67 2 62 28 4 48 18 1.68 GAGT GAGAGAGTGGGTTAGAGAGTGAGTGAGCCAGTGAATGAGTGAGTG 1069 1137 20 3.6 19 74 7 71 28 5 47 17 1.70 GAGTGAGTGAGCCAGGAGT GAGTGAGTGAGCCAGTGAATGAGTGAGTG 1619 1746 8 16.8 8 65 9 60 27 1 52 18 1.55 GAGT GAGTGAGTGAGTGAATGAGTGAATGGGAGT
I guess it's possible to
grep '^[0-9]'
, but I don't know how to join such grep output with the fasta name.
- Remove sequences that don't have repeats (like
My hypothesis is that tandem repeats has DNA motif like structure. How can I use repeat unit (like
GAGT
) and search for such motif occurrences genome wide?At the moment my plan is:
- Every unit
GAGT
has it's sequence of occurrencesGAGAGAGTGGGTTAGAGAGTGAGTGAGCCAGTGAATGAGTGAGTG
- Submit
GAGAGAGTGGGTTAGAGAGTGAGTGAGCCAGTGAATGAGTGAGTG
to MEME and get PSPM Scan PSPM genome wide
I am using MEME as I don't know how can I scan for a given unit
GAGT
allowing mismatches.What should I do with redundant repeat units? For example:
GAGT
andGAGTGAGTGAGCCAGGAGT
. Should I use the shortest unit likeGAGT
or all units even they overlap?
- Every unit
Thank you for you time.