Helo everyone, After blastx Unigenes.fa from de novo RNA-Seq assemblies to a reference proteome, I obtained the result that multicontigs belong to one gene. So I just try to uniform my data that to remove shorter congtigs and keep the longest contig representing one gene. example is as below:
From origal data
ENSDARP00000143232.1 GGCTCCTCTTTTTCAACTGGACATCCTTAAAACTGTATGAAAGGGGCGGAGCCTTTTGCTACTTGCATACTTAAGCTCCTTCACATTCCTCTAGCCCTTTACGAA ENSDARP00000143232.1 GGCTCCTCTTTTTCAACTGGACATCCTTAAAACTGTATGAAAGGGGCGGAGCCTTTTGCTACTTGCATACTTAAGCTCCTTCAC ENSDARP00000143232.1 GGCTCCTCTTTTTCAACTGGACATCCTTAAAACTGTATGAAAGGGGCGGAGCCTTTTGC
To what I want
ENSDARP00000143232.1 GGCTCCTCTTTTTCAACTGGACATCCTTAAAACTGTATGAAAGGGGCGGAGCCTTTTGCTACTTGCATACTTAAGCTCCTTCACATTCCTCTAGCCCTTTACGAA
Could you give me some suggestions or some scrips to help me, thanks!
Have u tried
cdhit
. Use query length to filter from blast file. How did you obtain unigenes?