I just had a quick question that I started to confuse myself with. When I am about to index my reference sequence for some RNA-Seq data and I got confused if I use the cDNA or CDS or transcript fasta file.
I basically encountered the same situation at the moment, i used Kallisto to map the raw data against transcripts of the reference pathogen. And now I am thinking to download transcripts of the host reference, but the assembed genome drop list only have "CDS from genomic FASTA. fna" or "translated CDS. faa" or "Protein FASTA . faa" to choose.
I was wondering which one means the transcripts. or how can you generate a transcripts file from the current resources.
You should not map against the genome using Salmon. You can either download a transcriptome file, or a genome file and transcript annotations, and use a tool like gffread to extract the transcript sequences. You most likely want to quantify against the cDNA to account for features such as UTRs.
It is your choice. If you use a program like salmon then you need to align to transcripts (if they are available for your genome). If you use a normal NGS aligner then you can align to genome and then count using a program like featureCounts or htseq-count.
I basically encountered the same situation at the moment, i used Kallisto to map the raw data against transcripts of the reference pathogen. And now I am thinking to download transcripts of the host reference, but the assembed genome drop list only have "CDS from genomic FASTA. fna" or "translated CDS. faa" or "Protein FASTA . faa" to choose.
I was wondering which one means the transcripts. or how can you generate a transcripts file from the current resources.
Many thanks