I typically download cdnas directly from Ensembl (like with wget ftp://ftp.ensembl.org/pub/release-93/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
), build a kallisto index
, and run kallisto quant
to estimate isoform abundance.
However, Ensembl tends to provide very detailed transcript models. Furthermore, the provided cdna files from Ensembl also contain lots of non-coding biotypes from NMD to retained intron.
So I was wondering if a better practice would be filtering the provided cdna.fasta for just those transcripts with a CCDS id or filtering by biotype (such as "protein coding")?
As an example a ccds-filter would cut down the number of cdnas of https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000077782 from 41 to 9.
How sensitive is kallisto with respect to overly complex/redundant gene architectures?