Dear everyone,
I have downloaded from TSA (Transcriptome Shotgun Assembly) the contig sequences of the same species but from two different BioProject (same authors, but different studies). One file contains ~800,000 sequences while the other has ~400,000 sequences.
I'm interested in identifying protein-coding regions and I'm using TransDecoder for that purpose. After running TransDecoder I have gotten ~300,000 and ~150,000 protein-coding regions, respectively. I'm aware that TransDecoder looks for possible ORF in all 6 reading frames, and so the initial amount of contig sequences is possibly correlated with the final number of proteins.
However, I'm wondering how can one infer the "true" (i.e. closest to reality) set of protein-coding regions for a species? For example, the proteome of Xenopus tropicalis contains right now 39,662 sequences (or mRNAs as stated here) and Anolis carolinensis 32,230. So why do I get so many proteins and how can I get a more realistic number?
Thanks!
I recommend you read the manual since it includes a way to include Blastp and Pfam searches to select coding regions.
Thanks for your suggestion @biofalconch , you are right, I knew about this optional step but I did not use it. I agree I would get less sequences including blastp or pfam searches, but what about novel proteins that are not in the reference databases? That's why I did not use it before... :(
Unless you are working with an extreme outlier, there should be something with hints of reasonable homology in current protein databases.
One reason would be if these sequences consist of multiple isoforms instead of only the longest isoform. Different splice-forms from the same transcript can give multiple CDS.
Thanks @Rohit , in both datasets there is only unique IDs, so I'm assuming that the authors kept only the longest isoform per gene before publishing the contig sequences in TSA.
Isoforms I cant be sure of with just the unique ID - what if there was pre-processing for changing the transcript names into unique ones. There is no mention in TSA about keeping only the longest isoform of the transcript. If there is a reference genome, mapping onto it with splice-aware mappers to make sure would definitely help. Else as @genomax suggested, there wouldn't be a huge difference