Hi all,
I am new to RNA-seq analysis. Currently, I am trying to use the salmon, tximport, edgeR pipeline to process my human RNA-seq results on galaxy. The cDNA library for my RNA-seq is generated from PolyA selection.
I am abit confused with the normlisation steps.
For salmon, i have aligned my reads to the human transcriptome, and used the human gff file for quant.genes.sf output, however, the TPM are still annotated with ENST00000XXXXXX.X instead of ENSGXXXXXXXXXXX. Does that mean salmon failed to recognise the GFF file and my TPM number is still for transcripts and not genes?
If salmon failed to produce the correct quant.genes.sf files, I would like to use tximport to aggregate my transcripts to genes with my quant.sf files. But I come across 4 options in tximport for "Summarization using the abundance (TPM) values?"------ i) No, ii) scaled up to library size, iii) scaled using the avg. transcript length over samples and then the library size, iv) scaled using the median transcript length among isoforms of a gene, and then library size.
Which option should I be using if I want to follow up with edgeR on degust? Will I "overnormalised" my results if I choose the wrong option to go with edgeR?
Any help would be appreciated. Many thanks in advance!
James
If you already ran salmon on transcript level there is no need anymore to provide it with a gff files of genome annotations for human (will not even work I think).
You can safely continue to tximport who will do the summarisation on gene level.
One thing you might consider doing is to use a transcriptome version with one transcript per locus?
lieven.sterck : which file of deseq2 we need to give as input to tximport and which gtf file.we.need to provide?
don't know exactly the name of that file but the one with the counts in it (tabular format file, with number of columns, among which one that is called TPM I think).
For the GTF, the one that links all transcripts to it's locus (== where one can determine which isoforms are from the same gene locus)