Question

How many of the SSR contained transcripts have ORF?

0

Entering edit mode

7.0 years ago

Farbod ★ 3.4k

Dear Biostars, Hi

I have searched my transcripts (longest isoform of each gene from RNA-seq data) using MISA to report any potential SSRs. My total number of SSR containing sequences is 93022.

Q: How to figure out that how many of these sequences/transcripts contain any ORF ?

Thanks

NOTE:

I have used Transdecoder to discover ORF of my whole transcripts, too. But I can not test all 93022 ID in Transdecoder result, manually.

ORF MISA SSR RNA-Seq • 2.1k views

ADD COMMENT • link updated 7.0 years ago by h.mon 35k • written 7.0 years ago by Farbod ★ 3.4k

0

Entering edit mode

can I collect the ssr contained transcript IDs in a text file and check for their representative in Trinity.fasta.transdecoder.pep file using some linux command line tools such as grep -F -f ?

ADD REPLY • link 7.0 years ago by Farbod ★ 3.4k

score 1 · Answer 1 · 2017-12-05

1

Entering edit mode

7.0 years ago

h.mon 35k

If I recall correctly (and I am mostly certain I do), ~~Trinotate~~ Transdecoder outputs a Trinity.fasta.transdecoder.bed, you could use this bed to get a orfs fasta and predict SSRs with MISA on this file.

ADD COMMENT • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Hi @h.mon and thanks,

By Trinotate, you mean Transdecoder?

The Transdecoder produce a .bed files, too as you mentioned.

You mean I should use that as my main transcript file in MISA?

NOTE:

the head of bed file is as:

track name='Trinity.fasta.transdecoder.gff3'

TRINITY_DN10003_c0_g1_i1 0 395 ID=TRINITY_DN10003_c0_g1_i1.p1;

TRINITY_DN10003_c0_g1~~TRINITY_DN10003_c0_g1_i1.p1;ORF_type:5prime_partial_len:125_(+),score=5.10 0 + 2 377 0 1 395 0

TRINITY_DN100126_c0_g1_i1 0 624 ID=TRINITY_DN100126_c0_g1_i1.p1;TRINITY_DN100126_c0_g1~~TRINITY_DN100126_c0_g1_i1.p1;ORF_type:complete_len:120_(+),score=39.02 0 +

ADD REPLY • link 7.0 years ago by Farbod ★ 3.4k

1

Entering edit mode

The Transdecoder produce a .bed files, too as you mentioned. You mean I should use that as my main transcript file in MISA?

No, you should use something like bedtools getfasta and use the resulting fasta as input to MISA.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Thanks, it seems that bedtools get fasta has many switches and options,

merging the Transdecoder .bed and original Trinity.fasta is what we intend to do?

ADD REPLY • link 7.0 years ago by Farbod ★ 3.4k

1

Entering edit mode

Untested:

bedtools getfasta -fo orfs.fas -fi Trinity.fasta -bed Trinity.fasta.transdecoder.bed

You possibly want -split.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

I created orfs.fas using your guidance

I guess I could use it as the MISA main file now, so why I need the -split option?

ADD REPLY • link 7.0 years ago by Farbod ★ 3.4k

0

Entering edit mode

It seems that I can not use this approach because I have used all isoforms for my Transdecoder ORF determination BUT I have used longest isoforms for each gene for SSR mining.

So my .bed file have many more member than my fasta file were used for SSR. So, the number of SSr contained transcripts that have orf CAN be more than the total number of SSR containing sequences!

Maybe I could use some linux script to collect 93022 SSR transcript IDs and collect their ORF results from Transdecoder.pep and then count them?

ADD REPLY • link 7.0 years ago by Farbod ★ 3.4k

1

Entering edit mode

You can use the fasta with longest orfs, provided the names of the sequences have not changed.

bedtools getfasta -fo orfs.fas -fi Trinity.longest.fasta -bed Trinity.fasta.transdecoder.bed

You will get a lot of warnings about names found in the bed file and absent in the fasta, as you removed sequences from the fasta.

ADD REPLY • link 7.0 years ago by h.mon 35k