Finding a sequence in a fasta file
5
2
Entering edit mode
10.4 years ago
mbk0asis ▴ 700

Hi, all!

Yesterday, I downloaded a human lncRNA fasta file from Genecode to find out if my presumably-novel-lncRNA-sequence exists in the database. I tried to find the sequence using simple linux commands like grep, and it seemed like it didn't exist. I don't want to jump into the conclusion, yet.... SO what I'm really looking for is any kinds of tools (e.g. scripts, UNIX commands, websites, and etc.) that can confirm my previous results.

FYI, I'm not capable to make my own scripts.

Thank you!

alignment sequence • 22k views
ADD COMMENT
3
Entering edit mode

Does the fasta file have linebreaks? If yes, that might be the reason for your grep not working. You could try

cat your.fasta | tr -d "\n" | grep "TAGACAT"

Also, don't forget to search the reverse-complement or alternatively just blast your seq against the fasta.

ADD REPLY
1
Entering edit mode

This will not report all instances of match.

This is a file called dummy.fa:

>123
agggaggttccatt
gaccttaggctaat
>345
gggagattcgtcta
tagatcagatcacg

The command

cat dummyfile.fa | tr -d "\n"

will give you this:

>123agggaggttccattgaccttaggctaat>345gggagattcgtctatagatcagatcacg

now when you grep a pattern it will report only one line, which means you will not know how many sequences really had this motif.

A small modification will fix this:

cat dummy.fa | tr -d "\n" | tr ">" "\n" | grep <regex>

but even this will not report multiple matches in the same sequence.

@OP what is that you really want to find

ADD REPLY
0
Entering edit mode

If you're using grep for sequences you should add grep -i

ADD REPLY
0
Entering edit mode

or run blast if you want to allow mismatches.

ADD REPLY
3
Entering edit mode
10.4 years ago

I wrote a python program fastaRegexFinder.py to look for regular expressions in fasta files. The output is in bed format, so easy to parse with other tools (unix commands, bedtools etc.). See if it's any useful...

However, if your pattern (sequence) is long enough you might use an aligner like BLAST as joe.cornish826 suggested.

Example of fastaRegexFinder:

echo '>mychr' > /tmp/mychr.fa
echo 'GTAAACACNNNNNGTAAACAANNNNTTGTTTAC' >> /tmp/mychr.fa

fastaRegexFinder.py -q -f /tmp/mychr.fa -r '[G|A]TAAA[C|T]AA?'

Output

mychr    0      7    mychr_0_7_for      7    +    GTAAACA
mychr    13    21    mychr_13_21_for    8    +    GTAAACAA
mychr    25    33    mychr_25_33_rev    8    -    TTGTTTAC

Link to fastaRegexFinder.py here

ADD COMMENT
0
Entering edit mode

Your script was very useful. Thank you very much! But it takes a bit of time when dealing with larger genomes.

Edit:

Checking a 90bp sequence in both forward and reverse of 700 Mb genome, took 1 min 48 sec.

Only forward strand took 15 sec.

ADD REPLY
0
Entering edit mode

Hi @dariober, is the fastaRegexFinder.py available to download for free? Thanks!

ADD REPLY
0
Entering edit mode

Hi- Yes it's free and I have updated the download link in my answer.

ADD REPLY
0
Entering edit mode

Hey dariober,

I am trying to use your script as below on a HPC using Slurm:

python fastaRegexFinder.py -f ragtag_nuclear_assemblyhypo_polishingragoo_scaffolds/ragtag.scaffolds.fasta -r 'ACGAAGGAACTACCCGTGTG'

It seems to run fine, I just do not know where the output file is saved. I do not get any information printed to screen apart from "Processing Chr1" etc. I have even tried to redirect the output with '>' but it also doesn't work.

Can you please let me know where I might find the output file? I have tried looking in current working directory, and the directory where the fasta file is located. I apologise if it is obvious.

Thank you, Aaron :)

ADD REPLY
0
Entering edit mode

Hi Aaron - when you say that redirecting to stdout via '>' doesn't work, do you mean you get an empty file? The simplest explanation is that nothing matches your regex and you get an empty file (this is by design). Could it be the case?

ADD REPLY
0
Entering edit mode

aaron.phillips : Did you check SLURM log files? Output may be getting saved there.

ADD REPLY
2
Entering edit mode
10.4 years ago
hpmcwill ★ 1.2k

I am guessing that you are trying to find a fixed sequence string in the set of sequences you have downloaded?

If that is the case you might find a couple of the pattern matching tools in EMBOSS helpful:

  • dreg: Regular expression search of nucleotide sequence(s)
  • fuzznuc: Search for patterns in nucleotide sequences

As with all the EMBOSS programs, these understand various sequence formats, including the fasta sequence format you downloaded the data in.

ADD COMMENT
0
Entering edit mode

dreg definitely took care of me. Thank you.

ADD REPLY
2
Entering edit mode
10.4 years ago
pld 5.1k

Just use BLAST or even BLAT to search your sequence against the know lncRNAs. All of the gencode data is mapped into ensembl. Ensembl has BLAST and BLAT support, so you can just go to their webpage to do this.

ADD COMMENT
0
Entering edit mode
10.4 years ago

Another possibility, if you have the software lying around...short read aligners like bwa and Bowtie are excellent at quickly finding matches to a lot of short sequences in large genomes. Though it may take a bit of work to get out the positions of every hit, if there are a lot of them.

ADD COMMENT
0
Entering edit mode
10.4 years ago
PoGibas 5.1k

mbk0asis, as joe.cornish826 said, you have to use blast. The question you're asking is "if my presumably-novel-lncRNA-sequence exists in the database". I want to emphasize that you're looking for lncRNA, not just sequence.

My suggestion would be: use blast in NONCODE, that would be a good start. Have in mind that there are much more in the lncRNome than Gencode. See my previous answer about lncRNAs in the human genome.

Also wanted to add that lncRNAs are not evolutionary conserved. Often it's hard to get signal using blast, so using grep would take you nowhere.

ADD COMMENT

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6