Quadruplex sequence batch prediction
1
4
Entering edit mode
9.6 years ago
Qi Zhao ▴ 50

Is there any prediction tool to identify Quadruplex sequences from a set of fasta seq? I have tried the following links posted by wiki, but the first one and the last one can not be accessed,while the QGRS mapper is only work for single sequence.

Quadruplex • 9.2k views
ADD COMMENT
0
Entering edit mode

pqsfinder is an R package that can analyze a batch of sequences. It allows for bulges and mismatches and the latest default settings have been calibrated against published G4-seq data, see here and here

Example:

library(pqsfinder)
dna <- readDNAStringSet(file="sequences.fa")
pqs <- lapply(dna,pqsfinder)

and possibly

library(qdapTools)
pqsdf <- list_df2df(lapply(lapply(pqs,ranges),as.data.frame))
ADD REPLY
0
Entering edit mode

Hi Qi Zhao and dariober, can u plz help me with my intern project, I have to count total number of this regular expression ([gG]{3,}\w{1,7}){3,}[gG]{3,}? Thank you

ADD REPLY
1
Entering edit mode

Hi Isha, I think you need to be much more specific with your problem description. Also, an intern project should involve a solid attempt to understand and formulate the problem. With all due respect, it doesn't seem you have dedicated nearly enough effort to accomplish that yet. Sure, we are going to help you, but we are not going to do the whole job for you.

ADD REPLY
4
Entering edit mode
9.6 years ago

Hi- The G-quadruplex sequence as defined in quadparser is effectively a regular expression, namely this one ([gG]{3,}\w{1,7}){3,}[gG]{3,} Which means: Look for a run of 3 or more G followed by 1 to 7 of any other base, this all thing repeated 3 or more times and end with 3 or more G.

I have written a simple program to match regular expressions in a fasta file, it's here fastaRegexFinder.py and the default regular expression is the G-quadruplex above. Example:

fastaRegexFinder.py -f ecoli_CP001509.3.fa -r '([gG]{3,}\w{1,7}){3,}[gG]{3,}'

Sample output:

CP001509.3    57421    57449    CP001509.3_57421_57449_for    28    +    GGGGAGTTGGGGGAATAAGGGCGGAGGG
CP001509.3    92396    92428    CP001509.3_92396_92428_for    32    +    GGGAAATTGTGGGGCAAAGTGGGAATAAGGGG
CP001509.3    101287    101310    CP001509.3_101287_101310_for    23    +    GGGCTGGGTGATGGGCTCGCGGG
CP001509.3    114116    114145    CP001509.3_114116_114145_for    29    +    GGGAAGGGGAGCCGTGGGGTAAAGAAGGG
CP001509.3    167410    167431    CP001509.3_167410_167431_for    21    +    GGGAGAGGGCCGGGGTGAGGG
CP001509.3    167449    167470    CP001509.3_167449_167470_rev    21    -    CCCTCACCCTAACCCTCTCCC
CP001509.3    217985    218009    CP001509.3_217985_218009_rev    24    -    CCCGACGACCCACGCGGCCCACCC
...

Help:

fastaRegexFinder.py -h
usage: fastaRegexFinder.py [-h] --fasta FASTA [--regex REGEX] [--matchcase]
                           [--noreverse] [--maxstr MAXSTR]
                           [--seqnames SEQNAMES [SEQNAMES ...]] [--quiet]
                           [--version]

DESCRIPTION

    Search a fasta file for matches to a regex and return a bed file with the
    coordinates of the match and the matched sequence itself.

    By default, fastaRegexFinder.py searches for putative G-quadruplexes on forward
    and reverse strand using the quadruplex rule described at
    http://en.wikipedia.org/wiki/G-quadruplex#Quadruplex_prediction_techniques.

    The default regex is '([gG]{3,}\w{1,7}){3,}[gG]{3,}' and along with its
    complement they produce the same output as in
    http://www.quadruplex.org/?view=quadbaseDownload

    Output bed file has columns:
    1. Name of fasta sequence (e.g. chromosome)
    2. Start of the match
    3. End of the match
    4. ID of the match
    5. Length of the match
    6. Strand
    7. Matched sequence as it appears on the forward strand

    For matches on the reverse strand it is reported the start and end position on the
    forward strand and the matched string on the forward strand (so the G4 'GGGAGGGT'
    present on the reverse strand is reported as ACCCTCCC).

    Note: Fasta sequences (chroms) are read in memory one at a time along with the
    matches for that chromosome.
    The order of the output is: chroms as they are found in the inut fasta, matches
    sorted within chroms by positions.

EXAMPLE:
    ## Test data:
    echo '>mychr' > /tmp/mychr.fa
    echo 'ACTGnACTGnACTGnTGAC' >> /tmp/mychr.fa

    fastaRegexFinder.py -f /tmp/mychr.fa -r 'ACTG'
        mychr    0    4    mychr_0_4_for    4    +    ACTG
        mychr    5    9    mychr_5_9_for    4    +    ACTG
        mychr    10    14    mychr_10_14_for    4    +    ACTG

    fastaRegexFinder.py -f /tmp/mychr.fa -r 'ACTG' --maxstr 3
        mychr    0    4    mychr_0_4_for    4    +    ACT[3,4]
        mychr    5    9    mychr_5_9_for    4    +    ACT[3,4]
        mychr    10    14    mychr_10_14_for    4    +    ACT[3,4]

    less /tmp/mychr.fa | fastaRegexFinder.py -f - -r 'A\w\wGn'
        mychr    0    5    mychr_0_5_for    5    +    ACTGn
        mychr    5    10    mychr_5_10_for    5    +    ACTGn
        mychr    10    15    mychr_10_15_for    5    +    ACTGn

DOWNLOAD
    fastaRegexFinder.py is hosted at http://code.google.com/p/bioinformatics-misc/



optional arguments:
  -h, --help            show this help message and exit
  --fasta FASTA, -f FASTA
                        Input fasta file to search. Use '-' to read the file from stdin.


  --regex REGEX, -r REGEX
                        Regex to be searched in the fasta input.
                        Matches to the reverse complement will have - strand.
                        The default regex is '([gG]{3,}\w{1,7}){3,}[gG]{3,}' which searches
                        for G-quadruplexes.                                   

  --matchcase, -m       Match case while searching for matches. Default is
                        to ignore case (I.e. 'ACTG' will match 'actg').

  --noreverse           Do not search the reverse complement of the input fasta.
                        Use this flag to search protein sequences.                                   

  --maxstr MAXSTR       Maximum length of the match to report in the 7th column of the output.
                        Default is to report up to 10000nt.
                        Truncated matches are reported as <ACTG...ACTG>[<maxstr>,<tot length>]

  --seqnames SEQNAMES [SEQNAMES ...], -s SEQNAMES [SEQNAMES ...]
                        List of fasta sequences in --fasta to
                        search. E.g. use --seqnames chr1 chr2 chrM to search only these crhomosomes.
                        Default is to search all the sequences in input.

  --quiet, -q           Do not print progress report (i.e. sequence names as they are scanned).                                   

  --version, -v         show program's version number and exit
ADD COMMENT
0
Entering edit mode

Thanks a lot, this is very useful, I will try it right now

ADD REPLY
0
Entering edit mode

Hi dariober, can you explain why the sequence of 3 or more G's followed by 1-7 of any other base can be repeated more than 3 times? That is, why is the regular expression ([gG]{3,}\w{1,7}){3,}[gG]{3,} instead of ([gG]{3,}\w{1,7}){3,3}[gG]{3,}? It seems like this was the behavior of quadparser, but I can't figure out the reasoning. Most of the references I have found only define a G-quadruplex sequence to include 4 G groups total.

ADD REPLY

Login before adding your answer.

Traffic: 2513 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6