Question

Aligning RNA-seq data against just a small set of transcripts of interest (and counting)

0

Entering edit mode

8.5 years ago

jpascualanaya ▴ 10

Would it be possible to align a whole RNA-seq against just a particular small set of transcripts, and not to a whole transcriptome? I.e., for example, just Hox genes, or just Wnt genes.

I am asking this because I am working with non-model organisms, with no genome or decent transcriptome available. After doing several attempts for a de novo assmebly transcriptomes, there was no way to get complete genes for most of those I am interested in, and way too many chimeric genes. But, after manually curating and tons of PCR I have now very reliable sequences for my set of transcripts. I would like to get expression level measures of this set of 43 genes in 8 developmental stages, and so although qPCRs are possible, I would rather try first to use the RNA-seq I have.

I thought on doing something similar to this: Create GFF from de novo assembly to input on htseq-counts

Align the RNA-seq datasets against the 43 genes (really low % of alignment expected), count the tags and calculate TPM myself. I just need the TPMs to then standardize (z-score) the data by gene.

Would that make sense?

Edit: edited title. I want to align and count

RNA-Seq alignment transcriptomics • 2.0k views

ADD COMMENT • link 8.5 years ago by jpascualanaya ▴ 10

score 0 · Answer 1 · 2016-05-18

0

Entering edit mode

8.5 years ago

WouterDeCoster 47k

I have the impression that you are mixing up two things. Do you want to align only against a small set of transcripts (as your title says) or do you want to perform counting only for a certain set (as indicated by the custom gff)? To what will you align if you don't have a reference genome available?

ADD COMMENT • link 8.5 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry if I didn't explain well. I want to allign and count.

So, I have a multi fasta of 43 genes whose sequence I have manually curated and now I want to have some measure of their expression levels at different developmental timings. What I plan to do is to align the RNA-seq data against those 43, let's say using bowtie, then count the reads aligned, using for example samtools, and then caluclate TPMs.

bowtie --> samtools --> TPM --> z-scores

The post I cite is just similar to my question, but I don't need the GTF. I was just citing it because the answer lead to something similar to my problem, but while their the whole transcriptome assembly is used, I wonder if using just a small set would be OK, since all methods I've seen align agains the whole transcriptome. As a matter of fact, for instance I used RSEM against just this set of 43 genes but obtained insanely high levels of expression (which are not true), so I was wondering if doing what I pretend is flawed somehow.

As for your last question, you can align directly against a transcriptome.

ADD REPLY • link 8.5 years ago by jpascualanaya ▴ 10

2

Entering edit mode

Your proposed method will lead to incorrectly high counts, since bowtie will produce more false positives due to having sequences from the whole transcriptome but only a few genes to align against. Use salmon or kallisto to get counts against the entire transcriptome and subset that to whatever you need.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k