Question

Bioinformatical sleuthing: log2 fold-change vs log2 fold-change scatterplot has a "cross" shape, what's wrong here?

2

Entering edit mode

7.5 years ago

moldach ▴ 130

Hello,

I'm trying some forensic bioinformatics (a le Baggerly and Coombs) to reproduce a scatter plot figure of log2 fold-change vs log2 fold-change. These are two shRNA knockout treatments against a shRNA control (shPLKO) and the log2 fold-change values are derived from cuffdiff output.

Here is the figure produced by the author http://i63.tinypic.com/25aqb1j.png

At first I plotted the gene_exp.diff file's log2 fold-change: http://i64.tinypic.com/90tx5t.png First thing to note: it doesn't look like the above picture - the author probably plotted isoform_exp.diff output from cuffdiff. Second thing is that there are some weird artifacts (e.g. the lines along the far left and the top as well as across zero for K562_shBRD4).

I got this via linux commands:

awk '$5 ~K562_shBRD4/ && $6 ~/K562_shPLKO/ {print $0}' gene_exp.diff > K562_shBRD4-shPLKO.diff # Grab every pariwise comparison from column 5 (sample 1) and column 6 (sample 2)
awk '$5 ~/K562_shMTHFD1/ && $6 ~/K562_shPLKO/ {print $0}' gene_exp.diff > K562_shMTHFD1-shPLKO.diff
cut -f 3,10 K562_shBRD4-shPLKO.diff > plko-brd4.diff # Cut out the "gene" column and "log2 fold-change" columns
cut -f 3,10 K562_shMTHFD1-shPLKO.diff > plko-mthfd1.diff
awk 'NR==FNR {h[$1] = $2; next} {print $1,$2,h[$1]}' plko-brd4.diff plko-mthfd1.diff > brd4-mthfd1.csv # Grab the second column of plko-mthfd1.diff (based on similar "gene" name in column 1 of plko-brd4.diff) and add it to a third column of a new file

Then plotted in R:

dt1 <- read.table("HAP1-brd4-mthfd1.csv", sep=" ")  # the /t separation of the .diff files was coerced into <space> separated via the "NR==FNR.." command
df1 <- as.data.frame(dt1)
colnames(df1) <- c("Gene", "log2fold_HAP1_shMTHFD1", "log2fold_HAP1_shBRD4")
df1 <- na.omit(df1)
gg_df1 <- df1 %>% ggplot(aes(x=log2fold_HAP1_shMTHFD1,y=log2fold_HAP1_shBRD4))
gg_df1 + geom_point(data=df1, alpha = 0.2, color="blue", na.rm = TRUE

The lines along the far left and top were caused by +/- inf values that cuffdiff produces. I removed them via:

awk '$2 !="inf" && $3 !="inf"' brd4-mthfd1.diff > 1brd4-mthfd1.diff # Remove anything with + inf values for either shBRD4 or  shMTHFD1
awk '$2 !="-inf" && $3 !="-inf"' 1brd4-mthfd1.diff > 2brd4-mthfd1.diff

I also removed the line in the center, to produce a good looking scatterplot (http://i65.tinypic.com/aep5w4.png), via:

awk '$2 !="0"' 2brd4-mthfd1.diff > 3brd4-mthfd1.diff

Okay so the above was to demonstrate the work I've done and that I've confirmed that there are also some "unwanted" information (artifacts) when I plot all (gene_exp.diff) log2 fold-change of FPKM values of shRNA treatments compared to the control.

When I plotted the isoform_exp.diff log2 fold-change there is an odd cross shape in the figure: http://i65.tinypic.com/2vjtoua.png

So obviously there was some sort of filtering that the author performed to remove values from the left and bottom parts of the cross. I think that these could likely be removed by specifying a range to remove. One way to possibly remove these artifacts is via some pseudo-code I thought of. Just a heads-up that I'm not even sure this awk command works.

awk '$2 !="1", $2 !="-1"' | awk '$3 !="1", $3 !="-1"' 3brd4-mthfd1.diff > 4brd4-mthfd1.diff # Remove a range of data in the "left" and "bottom" part of the cross, but not the data further down?

OK so even if the above worked (which it probably doesn't) it will not solve my problem of understanding what the author filtered the raw data on. They obviously filtered something out. But HOW? And more importantly, WHY?

When I tried filtering the data first by a log2 fold-change > 2 and FDR p-value (q-value) <0.05 via:

awk '$10>2.0 || $10(2.0* -1)' isoform_exp.diff | awk '$13<0.05' fc-q.diff
awk '$5 ~K562_shBRD4/ && $6 ~/K562_shPLKO/ {print $0}' gene_exp.diff > K562_shBRD4-shPLKO.diff 
# etc. as was done at the very top

I lose most of the data: http://i64.tinypic.com/5dqj4h.png

My next plan of action I guess is to use a less stringent fold change (and no significant values). Maybe the author was filtering (somehow) by genes which don't change in one condition but a lot in the other condition?

rna-seq R visualization scatterplot • 4.6k views

ADD COMMENT • link 7.5 years ago by moldach ▴ 130

0

Entering edit mode

I have no clue why you can not reproduce their results... However I don't think that you can call "artefact" a gene whith log2FC = {-inf, 0, inf}. Sometimes genes are not expressed in one specific condition, meaning that their log2FC goes to the infinite. This is a biological fact, not an artefact.

ADD REPLY • link 7.5 years ago by Carlo Yague 9.0k

0

Entering edit mode

This is true I should probably should have said unwanted then am I right as I still wouldn't want to include these in the figure. I would want to filter out those zero count features because I cannot base a conclusion or an interpretation on the absence of data.

Okay so I understand that one gets -inf ([log2(0)=-inf) from zero count but how does one get a +inf value?

ADD REPLY • link 7.5 years ago by moldach ▴ 130

0

Entering edit mode

The fold change is the ratio between the expression of a gene in condition A vs B. So the log2FC = log2(A/B);

if A=0, then log2FC = -inf.
if B=0, then log2FC = +inf
if A=B, then log2FC = 0

More verbosely, if a gene is not expressed in condition A, the log2FC is -inf. But it doesn't mean that this gene is not expressed in condition B and is not interesting.

ADD REPLY • link 7.5 years ago by Carlo Yague 9.0k

0

Entering edit mode

By the way, have you tried to filter the genes based on mean expression (FPKM or whatever) accross conditions ? Log2FC is "unstable" for genes poorly expressed. And its always better to filter with a different metric than the one you want to explore (log2FC in your case).

ADD REPLY • link 7.5 years ago by Carlo Yague 9.0k

0

Entering edit mode

When I plotted a heatmap of FPKM values (from cufflinks) I filtered this based on FDR p-value from cuffdiff. I also tried to filter log2 fold-change values based on FDR p-value but unsuccessfully. How would one filter log2FC via FPKM? My understanding that FPKM wasn't optimal to use because it's not comparable between samples.

ADD REPLY • link 7.5 years ago by moldach ▴ 130

0

Entering edit mode

When I plotted a heatmap of FPKM values (from cufflinks) I filtered this based on FDR p-value from cuffdiff.

It doesn't make sense to do that (unless you want to see the expression of the differentially expressed genes only). But generally you will want to take all genes (or a class of genes) into account when plotting the expression in A vs the expression in B.

FPKM wasn't optimal to use because it's not comparable between samples.

In this case, it doesn't really matter because you don't use the FPKM to compare your samples between them. But you can also use raw counts if you like.

ADD REPLY • link 7.5 years ago by Carlo Yague 9.0k

0

Entering edit mode

What I would normally do when comparing the results of two differential expression analyses is to filter the genes for expression level. This is especially critical for the isoform level, which can be very noisy depending on the annotation used. That's something that should be specified in a paper however. The author's figure is still problematic anyway, why would there be different clusters of genes / isoforms correlating with a different average slope? Did you actually use their cuffdiff results or did you have to run the program on the raw data? I think an important point would be whether you want to replicate their analysis or perform a differential expression analysis according to the best practices. Cufflinks has long been set aside in that regard. You can try featureCounts, DESeq2 for gene level or RSEM, salmon for isoform level.

ADD REPLY • link 7.5 years ago by Martombo ★ 3.2k

0

Entering edit mode

I actually have access to their cuffdiff results. It's true this should have been specified more clearly to me what the author actually did.

By problematic (different clusters correlating with different average slope) are you referring to the purple lines I've superimposed? http://i63.tinypic.com/2iglv2a.png

What about those encircled in red - which are the up and right parts of my cross shape- I'm not sure what this says about the data?

Can you please suggest how one would filter based on expression level. Thank you

ADD REPLY • link 7.5 years ago by moldach ▴ 130