Dear community, I'm currently working with the transcriptome from a nonmodel plant organism. For this study I began to assemble a transcriptome using the genome as a guide and using short + long reads. Afterward, I decided to extract all the non-mapping read pairs (from short reads) and nonmapping long reads (around 10% of the data) and build a new transcriptome reference-free. I also decided to check the quality of the non-mapping reads. Not surprisingly, I got some reads which I suspect have contamination since some of the samples contained 2 highpoints in the GC content plot. I assembled my reads using Trinity and I decided to blast randomly 100 sequences against nr. I was expecting to find fungi, human or animal sequences, but instead, I only got plant sequences in my results. Although this appears to be good news I want to make sure I really do not have contaminants sequences. What would be the best path to make sure I do not have contamint sequences?
I forgot to add this to the post, but some of the samples were grown outside without controlled conditions (this is a weird experimental setup, mas it was relevant for our biological question).
I do not have the finished genome (it is not even at a chromosome level). What am I trying to figure out is if in these sequences there transcripts that does not belong to my species. Bellow is the GC plot I talked in my post
Judging this purely using informatics is going to be inconclusive. As you already discovered some of these are coming up as plant sequences. That of course does not say much since they could be from diverse plant lineages and may represent contaminants.
You could try assembling a separate transcriptome from the controlled samples and see if things similar to these sequences show up there? If they don't then this observation could be considered a +1 for these being possible contaminants.
You could also build two transcriptomes (controlled samples and not) and then only select transcripts that are common in both?
What is the long term goal of your experiment? To generate a transcriptome and stop there?
Currently, we have two goals: Assembly of a transcriptome for our species, that could possibly be used in future studies and using that transcriptome ourselves for DGE and network analysis.