Would it be possible to align a whole RNA-seq against just a particular small set of transcripts, and not to a whole transcriptome? I.e., for example, just Hox genes, or just Wnt genes.
I am asking this because I am working with non-model organisms, with no genome or decent transcriptome available. After doing several attempts for a de novo assmebly transcriptomes, there was no way to get complete genes for most of those I am interested in, and way too many chimeric genes. But, after manually curating and tons of PCR I have now very reliable sequences for my set of transcripts. I would like to get expression level measures of this set of 43 genes in 8 developmental stages, and so although qPCRs are possible, I would rather try first to use the RNA-seq I have.
I thought on doing something similar to this: Create GFF from de novo assembly to input on htseq-counts
Align the RNA-seq datasets against the 43 genes (really low % of alignment expected), count the tags and calculate TPM myself. I just need the TPMs to then standardize (z-score) the data by gene.
Would that make sense?
Edit: edited title. I want to align and count
Sorry if I didn't explain well. I want to allign and count.
So, I have a multi fasta of 43 genes whose sequence I have manually curated and now I want to have some measure of their expression levels at different developmental timings. What I plan to do is to align the RNA-seq data against those 43, let's say using bowtie, then count the reads aligned, using for example samtools, and then caluclate TPMs.
bowtie --> samtools --> TPM --> z-scores
The post I cite is just similar to my question, but I don't need the GTF. I was just citing it because the answer lead to something similar to my problem, but while their the whole transcriptome assembly is used, I wonder if using just a small set would be OK, since all methods I've seen align agains the whole transcriptome. As a matter of fact, for instance I used RSEM against just this set of 43 genes but obtained insanely high levels of expression (which are not true), so I was wondering if doing what I pretend is flawed somehow.
As for your last question, you can align directly against a transcriptome.
Your proposed method will lead to incorrectly high counts, since bowtie will produce more false positives due to having sequences from the whole transcriptome but only a few genes to align against. Use salmon or kallisto to get counts against the entire transcriptome and subset that to whatever you need.