Question

Removing strandedness information for analysis RSEM/STAR/

0

Entering edit mode

4.6 years ago

omicsnstuff • 0

Hi all,

I've sequenced a good number of patient samples as per the best protocol for assessment of splicing and DGE and moving forward as I was advised to do, with using the GTEx data as control. I'm now noticing the gene expression is not analogous between these batches, many genes are not expressed in GTex which are expressed in my internal controls and my patient samples.

With the exception of the stranded option, the sequencing protocols are identical. My thinking is that this could be because the strand information was not retained in the GTEX protocol, but was in mine. Does this sound correct? If it cannot be determined which strand the transcript originated from because loci overlap then some genes will not be counted?

According to this post TruSeq strand-specificity in rsem-calculate-expression I can use the --forward-prob" parameter set to 0.5 for a non-strand-specific protocol. (Default: 0.5). I believe this might alleviate the problem?

With this, RSEM seems to be able to remove strand information from the data making those samples sequenced with stranded protocol comparable to those without stranded protocol.

Can anyone tell me if this is correct?

Kind Regards,

RNA-Seq STAR RSEM TRANSCRIPTOMICS RNA • 1.9k views

ADD COMMENT • link 4.6 years ago by omicsnstuff • 0

2

Entering edit mode

You can always transform bam files back to fastq and remap with any tool or option you want. I hope you are not putting your data and the GTEx data into the same statistical analysis towards DGE. This is not meaningful as batch effects will mask all biological differences. The results will be full of false results. DGE analysis is only possible if you have processed samples in the same way, so same lab, same kits, same sequencing regime.

ADD REPLY • link 4.6 years ago by ATpoint 85k

1

Entering edit mode

You would never have to remap with STAR. You can't tell STAR what the strandedness is even if you wanted to. It doesn't care. RSEM cares.

ADD REPLY • link 4.6 years ago by swbarnes2 14k

0

Entering edit mode

I was informed that normalising using TPM's allowed for this type of inter batch comparison? Is that incorrect?

ADD REPLY • link 4.6 years ago by omicsnstuff • 0

1

Entering edit mode

There is no way to correct for this batch effect. Whoever told you this is wrong. In fact TPM is not even an appropriate normalization method for DEG analysis. Please use the search function for one of many threads that explain why. I fear your entire analysis plan is based on flawed information. Sorry about that, but I strongly recommend not to do as you are advised. Results will be non-sense I fear.

The choice of the kit and the RNA extraction method are the main sources of bias. You can easily see this by downloading a couple of RNA-seq datasets from the same cell types. If you process them 100% identically in silico and then perform PCA they will cluster by study and never by cell type. Different library prep methods flavour different kinds of genes, either intended or unintended. There is by best knowledge no way to include unrelated studies into the same DEG analysis. You should discuss this issue with your PI and come up with a different analysis strategy or create normals yourself.

ADD REPLY • link 4.6 years ago by ATpoint 85k

1

Entering edit mode

Something is not quite right with your problem statement. The strandedness of a protocol has nothing to do with forward or reverse transcripts. Stranded RNA-Seq protocols are about the sense and antisense orientation of the reads themselves.

In stranded protocols, the first in pair is always in the same sense (depending on the protocol) relative to the transcript (for example matches the orientation of the transcript). The transcript itself could come from either the forward or reverse strand.

The stranded protocols can be used to detect sense/antisense transcription and other information with respect of the transcript directionality.

Finally, genes in general don't overlap in the coding regions, thus I don't understand how you could have a systematic problem where reads matching genes on the reverse strand would be assigned to genes on the forward strand - this indicates another problem somewhere in the downstream analysis.

ADD REPLY • link 4.6 years ago by Istvan Albert 102k

0

Entering edit mode

I've edited the post, hopefully my problem is clearer?

ADD REPLY • link 4.6 years ago by omicsnstuff • 0

0

Entering edit mode

genes on the reverse strand aren't picked up in the analysis

I don't understand this. You think that half your genes just aren't in your library because they are on the wrong strand?? Or that half your genes can't be counted because they are on the wrong strand?