Hi,
I have run Oases on a couple RNA-Seq samples. I am using the packaged Oases_pipeline.py script to run Oases.
If I look at stat.txt file (within the oasesPipelineMerged folder), it looks like I can use a pretty good list of contigs / transcripts (in terms of sequence size and list size) if I filter for those that are at least 200 bp long and have at least 10x coverage (defined by long_cov).
However, I'm not sure how to pick the appropriate FASTA sequences (to then use BLAST to predict the transcript's function). For example, if I search the FASTA headers in the contigs.fa file (again, within the oasesPipelineMerged folder), I can find IDs that match. However, the sequence length doesn't match and the coverage listed in the FASTA header doesn't match any of the coverage values in the stats.txt file. If I use relatively large IDs in the stats.txt file, it appears that the IDs don't match the FASTA headers in the transcripts.fa file (and it seems like I should be using the sequences within transcripts.fa, which I assume is the final product). So, I am not confident I am interpreting the stats.txt file correctly.
If I try to search for similar problems on-line, it sounds like the Velvet manual defines some of the column headers, and it seems like the IDs should in fact match the nodes in the contigs.fa file.
Can anybody explain the discrepancies that I am seeing (or suggest some other way to pick out the highest quality assembled transcripts)?
Right now, I am liking Oases because (I hope) that filtering strategy gives me ~10,000 transcripts with an average size of ~800 bp. Most other programs seem to give many more transcripts, with transcript counts that seem too high for me to believe.
Thanks,
Charles