To identify long non-coding RNAs (lncRNAs), I used three computational tools: PLEK, CPAT, and CPC2, to predict lncRNA candidates from a FASTA file. To minimize false positives, I further performed a DIAMOND BLASTx search (using default parameters) against the nr.fasta database. The contigs that showed hits were removed, and the remaining contigs were considered as putative lncRNAs.
However, upon randomly selecting a few sequences from the final lncRNA.fasta file and submitting them to the NCBI BLASTx web server, I observed that some sequences still showed hits. I am unsure how to proceed in such a scenario and would appreciate your guidance on the appropriate steps to refine my lncRNA dataset.
Check this out: Biorixv