Question

Id In Oases Pipeline Stats.Txt File

0

Entering edit mode

11.3 years ago

Charles Warden 8.3k

Hi,

I have run Oases on a couple RNA-Seq samples. I am using the packaged Oases_pipeline.py script to run Oases.

If I look at stat.txt file (within the oasesPipelineMerged folder), it looks like I can use a pretty good list of contigs / transcripts (in terms of sequence size and list size) if I filter for those that are at least 200 bp long and have at least 10x coverage (defined by long_cov).

However, I'm not sure how to pick the appropriate FASTA sequences (to then use BLAST to predict the transcript's function). For example, if I search the FASTA headers in the contigs.fa file (again, within the oasesPipelineMerged folder), I can find IDs that match. However, the sequence length doesn't match and the coverage listed in the FASTA header doesn't match any of the coverage values in the stats.txt file. If I use relatively large IDs in the stats.txt file, it appears that the IDs don't match the FASTA headers in the transcripts.fa file (and it seems like I should be using the sequences within transcripts.fa, which I assume is the final product). So, I am not confident I am interpreting the stats.txt file correctly.

If I try to search for similar problems on-line, it sounds like the Velvet manual defines some of the column headers, and it seems like the IDs should in fact match the nodes in the contigs.fa file.

Can anybody explain the discrepancies that I am seeing (or suggest some other way to pick out the highest quality assembled transcripts)?

Right now, I am liking Oases because (I hope) that filtering strategy gives me ~10,000 transcripts with an average size of ~800 bp. Most other programs seem to give many more transcripts, with transcript counts that seem too high for me to believe.

Thanks,

Charles

rna-seq • 3.2k views

ADD COMMENT • link 11.3 years ago by Charles Warden 8.3k

score 1 · Answer 1 · 2013-07-22

I have resolved the problem using the following strategy:

1) Re-run Oases using cov_cutoff=10 and min_trans_legth=200. This decreases the size of the transcript.fa file by more than 50%

2) Use the coverage parameter in the transcript.fa file sequence headers to select the primary transcript, in order to avoid running BLAST on multiple isoforms for the same gene. More specifically, I used coverage > 0.5.

Strictly speaking, this didn't explain the discrepancy with the coverage stats, but this fixed the higher level problem of wanting to produce a more conservative list.