Question

PlantTFDB transcription factor discovery.

0

Entering edit mode

7.8 years ago

jvire1 ▴ 10

Hi all,

I have assembled three transcriptomes of a non-model plant and have been writing up a report. Initially, I blastx and blastp [E-value 1e-5] queried the unigenes and coded for proteins against the entire collection of PlantTFDB protein sequences.

Upon analyzing the unigene blastx and blastp hits I came to the realization that I was getting way too many members of each of the 58 transcription factor families. For example ~3000 unigenes were annotated to bHLH for one of my assemblies, however according to the PlantTFDB species summary (http://planttfdb.cbi.pku.edu.cn/family.php?fam=bHLH) for this family the highest number of bHLH genes identified in one species was 559 (Panicum virgatum).

I have since then filtered the blastx and blastp results at an E-value of 1e-50 (as 1e-5 in hindsight was way too low) and >35% ID. This reduced the number of bHLH annotated unigenes to ~1000, but I suspect this is still too high of an estimate.

I have also been able to generate percent hit coverage stats for the blast results and was thinking that I could similarly filter the results to include hits above some percent hit coverage threshold.

Any suggestions on an alternative approach or a percent hit coverage threshold to filter with would be much appreciated.

RNA-Seq • 3.2k views

ADD COMMENT • link 7.8 years ago by jvire1 ▴ 10

score 0 · Answer 1 · 2017-10-06

0

Entering edit mode

7.8 years ago

jvire1 ▴ 10

Having slept on it I realized a solution would be to take the transcripts with blastp and blastx hits and analyze them with the PlantTFDB prediction server, which uses a much more robust method to determine homology and has limits on the size of uploaded sequences.

ADD COMMENT • link 7.8 years ago by jvire1 ▴ 10