Alignment Of Thousands Of Short Rnas To Local Database
4
3
Entering edit mode
13.0 years ago
Mattrition ▴ 80

I have a fasta file of about 200,000 RNA sequences and a server holding a local copy of Rfam. For each sequence I want to know the type of RNA it is most related to and ultimately retrieve statistics on the proportion of sequences that are most probably rRNAs, tRNAs, real sRNAs etc.

I'm sure this kind of thing has been done before plenty of times but I can't find much information on it myself. I was going to use BLAST for the job, retrieving the top scoring result from each output report using perl. My question is, is BLAST really capable of handling the local alignment of many very short (16 - 50nt) sequences in a reasonable amount of time?

blast local • 5.3k views
ADD COMMENT
0
Entering edit mode

I'll let others answer your direct BLAST question, but as a personal suggestion I would consider using MAFFT, which in fact is capable of doing what you describe.

ADD REPLY
0
Entering edit mode

That looks like a multiple alignment tool. The terminology is confusing here, my apologies. To clarify, I want to align each sequence in my RNA collection one after the other, find out the sequence they are most related to in the Rfam database, and extract information on that RNA's family to use in my statistics. Once I have a candidate family for each individual RNA, I can add up the number of RNAs in my collection that are part of each family and report that as percentages.

ADD REPLY
0
Entering edit mode

Soem more important questions. Is this a single organism RNA? Do you know which organism(s) are in the sample? Do you have a reference genome? If the answers are yes, then blast (and MAFFT anyway, you don't want to do MSA on 200.000 sequences!) is the wrong tool.

ADD REPLY
2
Entering edit mode
13.0 years ago
ALchEmiXt ★ 1.9k

You might want to consider BLAT (Blast Like Alignment Tool) which is very fast and produces similar really good to tweak results. We use it for many mapping questions where bowtie or bwa are not in the picture. The BLAT default output tables are easy to parse.

ADD COMMENT
1
Entering edit mode

I am confused by what you mean by "the best alignments"?! For perfect or near-perfect matches you could consider bowtie and bwa respectively. But my guess would still be that BLAT can do the job.

ADD REPLY
0
Entering edit mode

BLAT seems like the most simple solution to this. However, if its like BLAST I still worry that its not going to give me the best alignments.

ADD REPLY
0
Entering edit mode

By "best alignments" I mean exactly that BLAST possibly has very poor specificity when it comes to very small sequences. I've not heard of bwa and bowtie though so I will check those out too.

ADD REPLY
0
Entering edit mode

bowtie and bwa (as well as stampy and many others) are so-called short-read mappers and will work perfectly but considering stringinty required of the mapping it goes from stringent bowtie, to bwa (tolerates indels) till BLAT tolerates larger mods... you can fine-tune BLAT for tile size and sich. See their manual.

ADD REPLY
1
Entering edit mode
13.0 years ago
Michael 55k

Question: where is that data coming from? e.g. Meta-genomic short-reads of bacterial RNA?

There is the tool Rfam_scan to scan a local Rfam database. Could you use this software? You have a large number of sequences though, that might give you some trouble. You might then be able to parallelize by splitting up the input data. Rfam-scan uses Blast for a pre-screening. Could you at least use the same parameters?

Another suggestion: it might be good to use other tools. tRNA-scanSE is generally the best tool for detecting tRNA-genes in whole genomes. How well tRNA-scan works on short reads (especially given your read length is shorter than the tRNA!) you could have to evaluate yourself.

In any case I would run a chained work-flow where I would run blast on a specific database and probably tRNAscan to filter out easy hits. look at this answer for finding rRNA-genes, if this is RNA-seq data, then these reads might constitute the vast majority of the data, then remove the clear hits from further search to reduce the search size and run other more computational expensive tools.

ADD COMMENT
0
Entering edit mode

Its a short-read transcriptome library. Those are good suggestions, I'll take a look at those perl scripts. However, the other suggestions assume that I am particularly looking for tRNAs and rRNAs, which is not the case. i just need to quickly see what my library is composed of in terms of RNA families.

ADD REPLY
0
Entering edit mode

My suggestion to first screen for tRNA and rRNA was mainly inted to reduce the number of input sequences, especially to subtract ribosomal RNA (might be up to 90% of the sample) reads, in order to spead up the subsequent steps. The idea is mainly to run the most efficient reduction as a first step in each pipeline.

ADD REPLY
0
Entering edit mode
13.0 years ago

In my experience of trying to recreate published PCR experiments, I don't hold as reliable BLASTN results when the query is less than 16 nt against a (reference) genome database. BLASTN can do the job you propose, but may need to be run twice with different parameters optimal for the shorter and then again the longer length queries. Jorge's offer of MAFFT is good. Smith-Waterman can also do this.

ADD COMMENT
1
Entering edit mode

Blast doesn't guarantee the best alignment because it's designed for finding alignments against a large amount of sequence. It uses a heuristic where it looks for a exact match seed sequence and then extends off the seed sequence.

ADD REPLY
0
Entering edit mode

I'm a bit confused. Everybody seems to be suggesting MAFFT, but from what I've read its used primarily for multiple sequence alignment. What I want to do is a pairwise alignment with every RNA sequence to the database in question. Does MAFFT do this better than BLAST might?

ADD REPLY
0
Entering edit mode

That's very good to know. But why is MAFFT better for pairwise alignments?

ADD REPLY
0
Entering edit mode

I agree with DK about teh BLAST heuristic, but still, 200K seqs at an avg length of ~35 bp is still 6.5 - 7 MB of data. This is probably a lot larger a database than was used by BLAST for many years. It's not at all like a database of 100 seqs.

ADD REPLY
0
Entering edit mode

Smith-Waterman is the algorithm that yields optimal local alignments, and that Needleman-Wunsch algorithm would yield the optimal global alignments. Now, the problem is they are way too slow, no way you are going to succeed for 200k sequences in you lifetime (maybe exagerated). Then, recommending a MSA program is not helping I guess. The FASTA algorithm/heuristic has higher sensitivity than Blast, but is slower as a tradeof.

ADD REPLY
0
Entering edit mode

Smith-Waterman is the algorithm that yields optimal local alignments, and Needleman-Wunsch algorithm would yield the optimal global alignments. Now, the problem is they are way too slow, no way you are going to succeed for 200k sequences in your lifetime (maybe exagerated), you need a heuristic approach. Then, recommending a MSA program is not helping I guess. The FASTA algorithm/heuristic has higher sensitivity than Blast, but is slower as a tradeof.

ADD REPLY
0
Entering edit mode
13.0 years ago

I don't think blast is the best tool to use when you have short sequences. Best you can do is adjust word size to shortest possible (7 for blastn, 2 for blastx) and have a low e-value threshold (1E-100 or something).

I would use an aligner like the one Jorge suggested and do a bunch of pair wise alignments to find the optimal alignment.

ADD COMMENT
0
Entering edit mode

For BLASTN, window of 7 is right, but for this specialized database of 200K seqs, 1e-100 for e-value threshold is far too stringent, certainly for the smaller queries.

ADD REPLY
0
Entering edit mode

Yeah you are right. The search space is pretty small. Maybe look at the bitscore instead of the e-value.

ADD REPLY
0
Entering edit mode

Exactly as I would do.

ADD REPLY

Login before adding your answer.

Traffic: 1906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6