Question

analyzing RNAseq data in cytoscape

0

Entering edit mode

9.5 years ago

sgblackpearl ▴ 10

If you want to do a correlation analysis in RNASeq data, how do you analyze gene A and gene B where gene A has 5 transcripts and gene B has, let's say 6 transcripts. Should I take mean of gene A and gene B expression respectively?

next-gen RNA-Seq • 3.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 9.5 years ago by sgblackpearl ▴ 10

0

Entering edit mode

Instead of averaging (which might club together isoforms with widely different expression pattern), use the APPRIS db. Assuming your data is from any of the well studied model organisms, you can go here, to find the principal isoform for any given gene. Like the page for TP53. With more than one principal isoform I guess you could choose either.

The APPRIS anno. are also available as part of Ensembl BioMart =>

< image not found >

Then take only that isoform and do comparisons.

ADD REPLY • link updated 3.2 years ago by Ram 45k • written 9.5 years ago by Amitm ★ 2.3k

0

Entering edit mode

thanks Amit for your suggestion. Unfortunately my data is not from any of the model organism, rather it is from an unsequenced genome. My DGE list contains around 5k transcripts belonging to around 1000 genes. What to do? Should I go for highest scoring transcript?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by sgblackpearl ▴ 10

0

Entering edit mode

There is not an "accepted" way to do this. Depending on your use, though, choosing the highest-scoring transcript seems reasonable. If this is RNA-seq, though, why not just summarize to gene to begin with?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by Sean Davis 27k

0

Entering edit mode

hi,

You say its an unsequenced genome and yet you have isoform info. available. I'm not sure if you know which of the isoforms are protein-coding. If that info is available then you could select based on that.

In my experience with rna-seq data (humans only though), I have seen many times a transcript isoform with very high expression level and when I look at the biotype it turns out to be nonsense mediated decay (NMD) candidate or 'processed transcript' or similar ncRNA variants.

Hence I am not comfortable with going for the highest expressed isoform. But again I am not sure how comprehensive is the gene anno. info for your organism. If mostly its the protein-coding variants and not many ncRNAs, then you could summarize the isoforms for each gene as Sean already suggested.

ADD REPLY • link 9.5 years ago by Amitm ★ 2.3k

0

Entering edit mode

Exactly Amit, The annotation is not comprehensive for this organism. What exactly you mean by summarizing the isoform for each gene? taking average?

ADD REPLY • link 9.5 years ago by sgblackpearl ▴ 10

0

Entering edit mode

hi,

What I meant was you could average over the isoforms for each gene. This should be ok if most isoforms are protein-coding variants and not many ncRNA isoforms are present.

ADD REPLY • link 9.5 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thanks Amit

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by sgblackpearl ▴ 10

0

Entering edit mode

Amit, do you have any script for calculating pearson correlation between gene A and gene B, each having expression data from multiple timepoints

ADD REPLY • link 9.5 years ago by sgblackpearl ▴ 10

1

Entering edit mode

I have used R for calculating corr. coeff., using the cor.test command. I must say I haven't done much for specifically gene exp. I tried MINE on a large set of Illumina beadarrays and also tried this R pkg which also employs non-parametric distance measures. This was some exploratory stuff and left it halfway after I ran into memory problems (>100 arrays, >40k genes). Can't help you enough.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by Amitm ★ 2.3k