Software To Search Nucleotide Sequence Data By Regular Expressions
4
0
Entering edit mode
13.2 years ago
User 4391 ▴ 100

Hi,

I want to search the nucleotides which contain in the reference sequences data, now. And I want to search it with wild card.

For example

Sequence data      :   AGCTAGCTAGGCTAGCGGCTTTGGCGCCTAGCCAGA
Search Nucleotide:   TAGC
Or wild card :           TA*C, TA#C,TANC 
Result          :           I can know where search nuleotide or wild card contain in reference sequence.

I just know one software "genome traveler". Do anyone know more software, please show me? Thank you so much.

sequence • 4.2k views
ADD COMMENT
3
Entering edit mode
13.2 years ago
Michael 55k

This is a basic regular expression, many programming languages have an implementation of these: Java, Perl, Python, R, awk... The regexp pattern matching you describe is: TA.C

There are some ready to use tools in EMBOSS for that too, that's possibly better and more reliable, than a self-made solution. dreg and fuzznuc (fuzzy search), see here: http://manuals.bioinformatics.ucr.edu/home/emboss#searching

ADD COMMENT
3
Entering edit mode
13.2 years ago

The EMBOSS package contains a tool named dreg:

This searches for matches of a regular expression to a nucleic acid sequence.

A regular expression is a way of specifying an ambiguous pattern to search for. Regular expressions are commonly used in some computer programming languages and may be more familiar to some users than to others.

ADD COMMENT
1
Entering edit mode
13.2 years ago

Have a look at scan_for_matches - the best pattern scanner out there.

ADD COMMENT
0
Entering edit mode

Do you have evidence for that it is the best scanner out htere, or is it just your opinion?

ADD REPLY
0
Entering edit mode

Do you have evidence for that it is the best scanner out there, or is it just your opinion?

ADD REPLY
0
Entering edit mode

SFM is truely awesome IMHO. It is almost as fast as agrep and faster than nrgrep, but SFM is much more flexible. I have used it for many years.

ADD REPLY
0
Entering edit mode
7.3 years ago

If your data are in FASTA or FASTQ format (plain or gzipped) you may try to search nucleotide sequences by regular expressions with seqkit.

Here is an example:

## Dummy data
cat > input.fasta <<'EOT'
>seq1
AAAAAAAAAAAAAAAA
>seq2
AAGCGAATCGTGTGTG
>seq3
AAGCGAATCGAATGTG
>seq4
AAGCGAATCCAATGTG
EOT

# regex
seqkit grep -s -r -p "(G|C)A?T*A" input.fasta

# IUPAC degenerated nucleotide codes
seqkit grep -s -d -i -p RYSAA input.fasta

Flags meaning:

-s: search for the pattern in sequences

-r: patterns are regular expression

-d: pattern/motif contains degenerate bases

-i: ignore case

-p: search pattern (multiple values supported)

ADD COMMENT

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6