Hi- The G-quadruplex sequence as defined in quadparser is effectively a regular expression, namely this one ([gG]{3,}\w{1,7}){3,}[gG]{3,}
Which means: Look for a run of 3 or more G followed by 1 to 7 of any other base, this all thing repeated 3 or more times and end with 3 or more G.
I have written a simple program to match regular expressions in a fasta file, it's here fastaRegexFinder.py and the default regular expression is the G-quadruplex above. Example:
fastaRegexFinder.py -f ecoli_CP001509.3.fa -r '([gG]{3,}\w{1,7}){3,}[gG]{3,}'
Sample output:
CP001509.3 57421 57449 CP001509.3_57421_57449_for 28 + GGGGAGTTGGGGGAATAAGGGCGGAGGG
CP001509.3 92396 92428 CP001509.3_92396_92428_for 32 + GGGAAATTGTGGGGCAAAGTGGGAATAAGGGG
CP001509.3 101287 101310 CP001509.3_101287_101310_for 23 + GGGCTGGGTGATGGGCTCGCGGG
CP001509.3 114116 114145 CP001509.3_114116_114145_for 29 + GGGAAGGGGAGCCGTGGGGTAAAGAAGGG
CP001509.3 167410 167431 CP001509.3_167410_167431_for 21 + GGGAGAGGGCCGGGGTGAGGG
CP001509.3 167449 167470 CP001509.3_167449_167470_rev 21 - CCCTCACCCTAACCCTCTCCC
CP001509.3 217985 218009 CP001509.3_217985_218009_rev 24 - CCCGACGACCCACGCGGCCCACCC
...
Help:
fastaRegexFinder.py -h
usage: fastaRegexFinder.py [-h] --fasta FASTA [--regex REGEX] [--matchcase]
[--noreverse] [--maxstr MAXSTR]
[--seqnames SEQNAMES [SEQNAMES ...]] [--quiet]
[--version]
DESCRIPTION
Search a fasta file for matches to a regex and return a bed file with the
coordinates of the match and the matched sequence itself.
By default, fastaRegexFinder.py searches for putative G-quadruplexes on forward
and reverse strand using the quadruplex rule described at
http://en.wikipedia.org/wiki/G-quadruplex#Quadruplex_prediction_techniques.
The default regex is '([gG]{3,}\w{1,7}){3,}[gG]{3,}' and along with its
complement they produce the same output as in
http://www.quadruplex.org/?view=quadbaseDownload
Output bed file has columns:
1. Name of fasta sequence (e.g. chromosome)
2. Start of the match
3. End of the match
4. ID of the match
5. Length of the match
6. Strand
7. Matched sequence as it appears on the forward strand
For matches on the reverse strand it is reported the start and end position on the
forward strand and the matched string on the forward strand (so the G4 'GGGAGGGT'
present on the reverse strand is reported as ACCCTCCC).
Note: Fasta sequences (chroms) are read in memory one at a time along with the
matches for that chromosome.
The order of the output is: chroms as they are found in the inut fasta, matches
sorted within chroms by positions.
EXAMPLE:
## Test data:
echo '>mychr' > /tmp/mychr.fa
echo 'ACTGnACTGnACTGnTGAC' >> /tmp/mychr.fa
fastaRegexFinder.py -f /tmp/mychr.fa -r 'ACTG'
mychr 0 4 mychr_0_4_for 4 + ACTG
mychr 5 9 mychr_5_9_for 4 + ACTG
mychr 10 14 mychr_10_14_for 4 + ACTG
fastaRegexFinder.py -f /tmp/mychr.fa -r 'ACTG' --maxstr 3
mychr 0 4 mychr_0_4_for 4 + ACT[3,4]
mychr 5 9 mychr_5_9_for 4 + ACT[3,4]
mychr 10 14 mychr_10_14_for 4 + ACT[3,4]
less /tmp/mychr.fa | fastaRegexFinder.py -f - -r 'A\w\wGn'
mychr 0 5 mychr_0_5_for 5 + ACTGn
mychr 5 10 mychr_5_10_for 5 + ACTGn
mychr 10 15 mychr_10_15_for 5 + ACTGn
DOWNLOAD
fastaRegexFinder.py is hosted at http://code.google.com/p/bioinformatics-misc/
optional arguments:
-h, --help show this help message and exit
--fasta FASTA, -f FASTA
Input fasta file to search. Use '-' to read the file from stdin.
--regex REGEX, -r REGEX
Regex to be searched in the fasta input.
Matches to the reverse complement will have - strand.
The default regex is '([gG]{3,}\w{1,7}){3,}[gG]{3,}' which searches
for G-quadruplexes.
--matchcase, -m Match case while searching for matches. Default is
to ignore case (I.e. 'ACTG' will match 'actg').
--noreverse Do not search the reverse complement of the input fasta.
Use this flag to search protein sequences.
--maxstr MAXSTR Maximum length of the match to report in the 7th column of the output.
Default is to report up to 10000nt.
Truncated matches are reported as <ACTG...ACTG>[<maxstr>,<tot length>]
--seqnames SEQNAMES [SEQNAMES ...], -s SEQNAMES [SEQNAMES ...]
List of fasta sequences in --fasta to
search. E.g. use --seqnames chr1 chr2 chrM to search only these crhomosomes.
Default is to search all the sequences in input.
--quiet, -q Do not print progress report (i.e. sequence names as they are scanned).
--version, -v show program's version number and exit
pqsfinder is an R package that can analyze a batch of sequences. It allows for bulges and mismatches and the latest default settings have been calibrated against published G4-seq data, see here and here
Example:
and possibly
Hi Qi Zhao and dariober, can u plz help me with my intern project, I have to count total number of this regular expression
([gG]{3,}\w{1,7}){3,}[gG]{3,}
? Thank youHi Isha, I think you need to be much more specific with your problem description. Also, an intern project should involve a solid attempt to understand and formulate the problem. With all due respect, it doesn't seem you have dedicated nearly enough effort to accomplish that yet. Sure, we are going to help you, but we are not going to do the whole job for you.