Question

Can you suggest tools or script to get canonical sequences from a cds fasta file?

0

Entering edit mode

14 months ago

Yuto • 0

I have a CDS fasta file from NCBI, and there are around 11k duplicated transcripts. I would like to only get the canonical sequences (would be the longest transcript, I suppose?), so there won't be duplicated coding sequences in my downstream analyses. Can you please suggest tools I can easily use?

Thanks!

genome cds fasta • 743 views

ADD COMMENT • link updated 14 months ago by Ram 45k • written 14 months ago by Yuto • 0

0

Entering edit mode

If you are working with human data then look into the MANE project: https://www.ncbi.nlm.nih.gov/refseq/MANE/#Select

ADD REPLY • link 14 months ago by GenoMax 150k

0

Entering edit mode

The longest transcript is not necessary canonical, as: encoding the most abundant protein/most conserved among species etc. The 11k seems to be to small for human/mammalian transcripts. Anyway, ENSEMBL GTF files should have canonical tag (no idea if this is valid for annotations of more exotic species) so you can get the canonical selecting these and extracting sequences from fasta using i.e. bedtools.

ADD REPLY • link 14 months ago by Darked89 4.7k

score 0 · Answer 1 · 2024-02-15

Which organism is it and for what purpose do you need the data? For human and mouse you can use the CCDS database. The meaning of "canonical" may also vary by organism. In the end it may be rather a subjective choice (for human it is "what's annotated as canonical in the database") and depends on your purpose. I would prefer experimentally validated or manually annotated over the longest sequence.