Dear all,
I'm getting confused about a so basic matter, please clear me what happened. I did de novo transcriptome assembly for a non-model organism, then run blastx. I computed part of blastx output using the cods:
cut -f1 blast_output.txt | sort -u | wc -l
(that show how many of query sequences got a hit) and
cut -f2 blast_output.txt | sort -u | wc -l
which show how many subjects did my query sequences hit), these number were 36725 and 16542, respectively, for one of my assembly, with 57210 sequences is it usual?. Please be patient with me and tell me how to present the results, in fact can I say 36725 from 57210 has been annotated? Also, please explain what is the source of this difference between two numbers (36725 and 16542), one reason is, more than one contigs got the same hit, am I right? I'm so concerned about the issue, please put here what you know regardless the issue may be simple and stupid for you.
Many thanks
Hi 5heikki, although I got your reply in my email, it did not appear here! yes, it's for transcriptome assembly, is it usual?. About unique ORFs within contigs that you mentioned, it's assume that every unique contigs bear unique ORF within itself otherwise there is chimeric contigs, am I right or wrong? please share me your idea
So 36,725 out of 57,210 putative assembled mRNAs hit in total 16,542 unique sequences of some database? What was the database? What were the thresholds? Since your organism is diploid, we can expect that all its expressed proteins are transcribed from at least two loci as nearly or exactly identical mRNAs, yes? How did you assemble the transcriptome? Was there any actual assembly or did you just e.g. merge pairs? If you cluster your transcriptome at e.g. 99% identity, how many clusters are there?
Thanks for following the post. This assembly was done by CLC genomic software with (k=64) after read trimming, and exposed to blastx against uniprot database (viridiplantae). we can assume transcription from at least two nearly or exactly identical mRNA, also it may resulted from alternative splicing form that produce members of one protein family, however I'm not sure about them, what's your idea?. In addition, I did another assembly with Trinity and mixed it with the CLC assembly, then subjected to cd-hit-est to remove redundancy (threshold 1), it generated 182968 clusters from 204397 input sequences, the blastx was done on this assembly against just arabidopsis proteome as database (for fast evaluation) and Although 80% of contigs got hit, only 28% of hits were unique. These results make me crazy as I don't know they are usual or not, what strategy is right? what's wrong and how to solve or even improve it? Please share me your opinion about the issue.
Many thanks to read me and help me out on it.