Yesterday, I downloaded a human lncRNA fasta file from Genecode to find out if my presumably-novel-lncRNA-sequence exists in the database. I tried to find the sequence using simple linux commands like grep, and it seemed like it didn't exist. I don't want to jump into the conclusion, yet.... SO what I'm really looking for is any kinds of tools (e.g. scripts, UNIX commands, websites, and etc.) that can confirm my previous results.
I wrote a python program fastaRegexFinder.py to look for regular expressions in fasta files. The output is in bed format, so easy to parse with other tools (unix commands, bedtools etc.). See if it's any useful...
However, if your pattern (sequence) is long enough you might use an aligner like BLAST as joe.cornish826 suggested.
It seems to run fine, I just do not know where the output file is saved. I do not get any information printed to screen apart from "Processing Chr1" etc. I have even tried to redirect the output with '>' but it also doesn't work.
Can you please let me know where I might find the output file? I have tried looking in current working directory, and the directory where the fasta file is located. I apologise if it is obvious.
Hi Aaron - when you say that redirecting to stdout via '>' doesn't work, do you mean you get an empty file? The simplest explanation is that nothing matches your regex and you get an empty file (this is by design). Could it be the case?
Just use BLAST or even BLAT to search your sequence against the know lncRNAs. All of the gencode data is mapped into ensembl. Ensembl has BLAST and BLAT support, so you can just go to their webpage to do this.
Another possibility, if you have the software lying around...short read aligners like bwa and Bowtie are excellent at quickly finding matches to a lot of short sequences in large genomes. Though it may take a bit of work to get out the positions of every hit, if there are a lot of them.
mbk0asis, as joe.cornish826 said, you have to use blast. The question you're asking is "if my presumably-novel-lncRNA-sequence exists in the database". I want to emphasize that you're looking for lncRNA, not just sequence.
My suggestion would be: use blast in NONCODE, that would be a good start. Have in mind that there are much more in the lncRNome than Gencode. See my previous answer about lncRNAs in the human genome.
Also wanted to add that lncRNAs are not evolutionary conserved. Often it's hard to get signal using blast, so using grep would take you nowhere.
Does the fasta file have linebreaks? If yes, that might be the reason for your grep not working. You could try
cat your.fasta | tr -d "\n" | grep "TAGACAT"
Also, don't forget to search the reverse-complement or alternatively just blast your seq against the fasta.
This will not report all instances of match.
This is a file called dummy.fa:
The command
will give you this:
now when you grep a pattern it will report only one line, which means you will not know how many sequences really had this motif.
A small modification will fix this:
but even this will not report multiple matches in the same sequence.
@OP what is that you really want to find
If you're using grep for sequences you should add
grep -i
or run blast if you want to allow mismatches.