Question

Scan a genome for the ocurrence of a sequence motif in the form ...CC[CT]C[AC]GCCG[CT]CC...

2

Entering edit mode

9.7 years ago

qwerty ▴ 50

Hello,

I have as output from MEME several motifs of length ~30 bp, which have the form ...CC[CT]C[AC]GCCC[CT]C... (for example). I want to search the genome to find all occurrences of such a motif.

I previously tried finding the exact motif with Bowtie - this works in just several seconds, but I have to fix all nucleotides and allow just up to two mismatches using this strategy. What would be a better (still quick) solution?

Thanks!

ChIP-Seq alignment motif genome • 3.4k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by qwerty ▴ 50

Ram · Answer 1 · 2015-02-20

I wrote a simple script, fastaRegexFinder.py, to look for regular expressions in a fasta file, the format you have should be ready to go. However, it looks for one regex pattern at a time. You could run one motif at a time, serially or in parallel, or chain several motifs together, like 'AC[TA]G|CC[TA]G|ACG'. (I should edit it to take multiple regexes...)

This is from the script's help:

fastaRegexFinder.py -h
usage: fastaRegexFinder.py [-h] --fasta FASTA [--regex REGEX] [--matchcase]
                           [--noreverse] [--maxstr MAXSTR]
                           [--seqnames SEQNAMES [SEQNAMES ...]] [--quiet]
                           [--version]

DESCRIPTION

    Search a fasta file for matches to a regex and return a bed file with the
    coordinates of the match and the matched sequence itself.

    By default, fastaRegexFinder.py searches for putative G-quadruplexes on forward
    and reverse strand using the quadruplex rule described at
    http://en.wikipedia.org/wiki/G-quadruplex#Quadruplex_prediction_techniques.

    The default regex is '([gG]{3,}\w{1,7}){3,}[gG]{3,}' and along with its
    complement they produce the same output as in
    http://www.quadruplex.org/?view=quadbaseDownload

    Output bed file has columns:
    1. Name of fasta sequence (e.g. chromosome)
    2. Start of the match
    3. End of the match
    4. ID of the match
    5. Length of the match
    6. Strand
    7. Matched sequence as it appears on the forward strand

    For matches on the reverse strand it is reported the start and end position on the
    forward strand and the matched string on the forward strand (so the G4 'GGGAGGGT'
    present on the reverse strand is reported as ACCCTCCC).

    Note: Fasta sequences (chroms) are read in memory one at a time along with the
    matches for that chromosome.
    The order of the output is: chroms as they are found in the inut fasta, matches
    sorted within chroms by positions.

EXAMPLE:
    ## Test data:
    echo '>mychr' > /tmp/mychr.fa
    echo 'ACTGnACTGnACTGnTGAC' >> /tmp/mychr.fa

    fastaRegexFinder.py -f /tmp/mychr.fa -r 'ACTG'
        mychr    0    4    mychr_0_4_for    4    +    ACTG
        mychr    5    9    mychr_5_9_for    4    +    ACTG
        mychr    10    14    mychr_10_14_for    4    +    ACTG

    fastaRegexFinder.py -f /tmp/mychr.fa -r 'ACTG' --maxstr 3
        mychr    0    4    mychr_0_4_for    4    +    ACT[3,4]
        mychr    5    9    mychr_5_9_for    4    +    ACT[3,4]
        mychr    10    14    mychr_10_14_for    4    +    ACT[3,4]

    less /tmp/mychr.fa | fastaRegexFinder.py -f - -r 'A\w\wGn'
        mychr    0    5    mychr_0_5_for    5    +    ACTGn
        mychr    5    10    mychr_5_10_for    5    +    ACTGn
        mychr    10    15    mychr_10_15_for    5    +    ACTGn

DOWNLOAD
    fastaRegexFinder.py is hosted at http://code.google.com/p/bioinformatics-misc/



optional arguments:
  -h, --help            show this help message and exit
  --fasta FASTA, -f FASTA
                        Input fasta file to search. Use '-' to read the file from stdin.


  --regex REGEX, -r REGEX
                        Regex to be searched in the fasta input.
                        Matches to the reverse complement will have - strand.
                        The default regex is '([gG]{3,}\w{1,7}){3,}[gG]{3,}' which searches
                        for G-quadruplexes.                                   

  --matchcase, -m       Match case while searching for matches. Default is
                        to ignore case (I.e. 'ACTG' will match 'actg').

  --noreverse           Do not search the reverse complement of the input fasta.
                        Use this flag to search protein sequences.                                   

  --maxstr MAXSTR       Maximum length of the match to report in the 7th column of the output.
                        Default is to report up to 10000nt.
                        Truncated matches are reported as <ACTG...ACTG>[<maxstr>,<tot length>]

  --seqnames SEQNAMES [SEQNAMES ...], -s SEQNAMES [SEQNAMES ...]
                        List of fasta sequences in --fasta to
                        search. E.g. use --seqnames chr1 chr2 chrM to search only these crhomosomes.
                        Default is to search all the sequences in input.

  --quiet, -q           Do not print progress report (i.e. sequence names as they are scanned).                                   

  --version, -v         show program's version number and exit

Ram · Answer 2 · 2015-02-20

1

Entering edit mode

9.7 years ago

Alex Reynolds 35k

If better is faster, depending on how many string combinations you have, this could be an easily parallelized problem - you generate all combinations of target strings, then do an exact, no-mismatch BLAT search on each target, on its own node.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Alex Reynolds 35k

Ram · Answer 3 · 2015-02-20

1

Entering edit mode

9.7 years ago

Vivek Todur ▴ 60

Hi,

There are many tools available for the achieving the purpose, You can simply use fuzznuc from emboss package, which also supports mismatches. Apart from that, MAST and FIMO are available, both belonging to the same MEME package.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Vivek Todur ▴ 60