Count repeat sequence
2
0
Entering edit mode
8.4 years ago
sacha ★ 2.4k

Hi,

I just wonder how can I quantify the amount of repeat sequence across a fasta file . For exemple : actgtcccgtaaaACTACTACTACTACTACTACTatcgtcgtcg
This sequence contains a repeat of ACT in more than 40% of the total sequence .

I m looking on fourrier transformation and entropy . Do you have other idea ?

Thanks

repeat dna fasta algorithm • 4.0k views
ADD COMMENT
0
Entering edit mode

If you only want the counts then sed 's/ACT/ACT\n/gi' file | grep -c 'ACT'

ADD REPLY
0
Entering edit mode

Or shorter using perl ;-)

perl -ne 'print $_=~y/ACT//'

or if you want the percentage:

perl -ne '$l=length($_);$m=$_=~y/ACT//; print $m/$l*100'
ADD REPLY
0
Entering edit mode

This is working for this case. What I want is to know for different case. For exemple, how much repeat region they are in the human genom

ADD REPLY
0
Entering edit mode

Hasn't that been done already (repeatmasker)?

ADD REPLY
0
Entering edit mode

yes ! But I want to it myself :D ... for the fun !

ADD REPLY
1
Entering edit mode
8.4 years ago

You can look at Shannon entropy quite quickly with BBMask, part of the BBMap package. e.g.

bbmask.sh in=genome.fa window=80 entropy=0.70 ke=5

This will tell you the fraction of the genome with entropy below 0.7 when using a window of 80 and 5-mers. The code for the entropy masking of a sequence is in /bbmap/current/jgi/BBMask.java at lines 1067-1144 and the entropy calculation over a window is very short, at 1158-1168.

ADD COMMENT
0
Entering edit mode
8.4 years ago
sacha ★ 2.4k

http://www.repeatmasker.org/ is the tool for doing this. I will probably find my answer there .

ADD COMMENT
1
Entering edit mode

I've never looked into how repeat masker works - i'm curious if it works by finding over-represented kmers in a genome, then applies what it found to the genome tracks (for repeat regions), or if it reads the DNA as a stream of information and then decides if something is a repeat as it goes.

For example, imagine "AACCGGTT" only turns up 10 in a whole genome, however it just so happens to occur in:

"gggggAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTAACCGGTTggggg"

Making it a repeat in this little snippit of DNA, but not a significant kmer for the X-billion bp genome, and thus maybe not called as a repeat in this snippit either.

ADD REPLY
1
Entering edit mode

I don't know about RepeatMasker, but BBMask will call it a repeat in that snippet if you use the maskrepeats flag (as opposed to, or in addition to, the maskentropy flag). Specifically:

maskrepeats=f       (mr) Mask areas covered by exact repeat kmers.
kr=5                Kmer size to use for repeat detection (1-15).  Use minkr and maxkr to sweep a range of kmers.
minlen=40           Minimum length of repeat area to mask.

So, by default, if there are at least 40 consecutive bases made of a repeating 5-mer, it would be masked. In your example, which I named "x.fa":

bbmask.sh in=x.fa mr=t me=f minlen=16 k=8
java -ea -Xmx89571m -Xms89571m -cp /usr/common/jgi/utilities/bbtools/prod-v36.09/lib/BBTools.jar jgi.BBMask in=x.fa mr=t me=f minlen=16 k=8
Executing jgi.BBMask [in=x.fa, mr=t, me=f, minlen=16, k=8]

Loading input
Loading Time:                   0.022 seconds.

Masking repeats (to disable, set 'mr=f')
Repeat Masking Time:            0.010 seconds.
Ref Bases:                        90    0.01m bases/sec
Repeat Bases Masked:              80

Converting masked bases to N
Done Masking
Conversion Time:                0.000 seconds.

Total Bases Masked:               80/90 88.889%
Total Time:                     0.067 seconds.

I changed K to 8 because the repeat is 8 bp long. The default of 5 does not mask anything in that case. But since you don't know the length of the repeat kmers ahead of time, you can define a kmer sweep, such as "k=4-9" and it will test all of them.

ADD REPLY
2
Entering edit mode

That's pretty awesome - and a much better way of doing it than the first method :) The BB suite really seems to cover all the bases (no pun intended!) :)

I'd consider this a good application for the ACGTrie method of counting kmers in a trie/DAG so you don't have to specify a kmer before hand, but it would make things a lot more complicated if you had to prune the trie for every snippit than a kmer table where the oldest kmers are dropped as new ones come in.

ADD REPLY
0
Entering edit mode

I generally try to avoid tries or linked lists because they have a lot of memory random-accesses. The method I use is brute-force and does not even use a kmer table; it's O(N*W/K) where N is the number of bases, W is the window size, and K is the kmer length. But it's very fast in practice because it does not do any memory random-access, just serial access of the bytes in the sequence. The entropy-masking does make use of a kmer table, though, dropping the oldest kmer as a new one comes in. It's O(N).

ADD REPLY
0
Entering edit mode

Yeah, trie's certainly aren't fast compared to the kmer table with all that random memory access, however they can be faster if you're doing a range of kmer sizes as you can add/check many kmers sizes in a single walk. But clearly the bruteforce method you've settled upon is the fastest which these days is probably the most important thing, since cores and memory per core is going up, and speed of the RAM access has stayed relatively the same. I feel like their could be a whole "how to do bioinformatics" tutorial series written using entirely the BB suite and a few other select tools. I think thats a really great thing :) Thank you for taking the time to write this software Brian.

ADD REPLY
0
Entering edit mode

You will also find repeatmasked versions of human genome at UCSC (and other places).

ADD REPLY

Login before adding your answer.

Traffic: 2192 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6