I have a CDS fasta file from NCBI, and there are around 11k duplicated transcripts. I would like to only get the canonical sequences (would be the longest transcript, I suppose?), so there won't be duplicated coding sequences in my downstream analyses. Can you please suggest tools I can easily use?
Thanks!
If you are working with human data then look into the MANE project: https://www.ncbi.nlm.nih.gov/refseq/MANE/#Select
The longest transcript is not necessary canonical, as: encoding the most abundant protein/most conserved among species etc. The 11k seems to be to small for human/mammalian transcripts. Anyway, ENSEMBL GTF files should have canonical tag (no idea if this is valid for annotations of more exotic species) so you can get the canonical selecting these and extracting sequences from fasta using i.e.
bedtools
.