Entering edit mode
5.9 years ago
zizigolu
★
4.3k
I am combing raw read counts of two data sets by taking average of counts; Gene CHGA is differentially expressed in each of datasets separately but not differentially expressed when I merged data sets by taking average of raw counts for both data.
My question is; should I take average of raw read counts or normalized read counts (for example CPM ) then converting to raw count before differential expression?
> merged[rownames(merged)=="CHGA",]
A2 A3 A4 A6 A7 A8 A9 A10 A11 A12 B4 B5 B6 B7 B8 B9 B10 B11 B12 C1 C2 C3 C4 C5 C6 G12 D1 D2 D3 D4
CHGA 20 3309 223 297 92 2072 147 2042 59 5356 92 899 180 16 22 67 212 80 72 270 36 198 110 170 202 52 53 32 1630 784
D5 D6 D7 D8 D9 D10 D11 D12 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 F1 F2 F5 F6 F7 G8
CHGA 555 434 292 163 1144 2090 73 1300 277 270 89 1037 5888 8484 70 152 32942 20 19 206 978 76 4080 1318 202 70
> biomarker[rownames(biomarker)=="CHGA",]
A2 A3 A4 A6 A7 A8 A9 A10 A11 A12 B4 B5 B6 B7 B8 B9 B10 B11 B12 C1 C2 C3 C4 C5 C6 G12 D1 D2 D3 D4 D5
CHGA 17 3366 70 530 30 1833 57 1431 62 32 146 144 320 16 33 109 340 111 116 516 53 202 4 51 397 79 65 30 681 780 981
D6 D7 D8 D9 D10 D11 D12 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 F1 F2 F5 F6 F7 G8
CHGA 816 529 120 1560 3167 37 1327 131 152 52 47 3080 16924 82 133 42006 10 8 258 811 22 147 2551 160 32
> immune[rownames(immune)=="CHGA",]
A2 A3 A4 A6 A7 A8 A9 A10 A11 A12 B4 B5 B6 B7 B8 B9 B10 B11 B12 C1 C2 C3 C4 C5 C6 G12 D1 D2 D3 D4
CHGA 23 3252 376 64 154 2311 237 2653 56 10681 37 1654 40 16 10 25 83 48 27 25 20 193 215 290 7 26 41 35 2579 787
D5 D6 D7 D8 D9 D10 D11 D12 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 F1 F2 F5 F6 F7 G8
CHGA 129 52 54 206 729 1012 109 1272 423 387 126 2027 8696 43 58 172 23878 31 30 155 1144 130 8014 85 244 107
No, this seems not like a correct approach. If possible combine both datasets in a design including a batch factor. Read more about it in limma or edgeR manuals how to make a design for this approach.
Thank you, I learned how to do batch correction but the problem is we have 700 common genes between both data. I got confused should I do batch correction for 700 common genes or whole of both data?
Try to solve the problem at the beginning, find the raw data such as fastq or bam files, and then generate the raw read counts for the same set of genes.
This is a HTG EdgeSeq assay, I was given excel files of raw read counts of both data :(
I have done t-test of 700 common genes between two data and removed inconsistent genes between data sets (p-value < 0.05). From 700 genes 400 genes showed consistent expression of which I took average of raw read counts of both data and added up these genes with uncommon genes and made a matrix of raw counts but differential expressed genes by DESeq2 says DEGs changed a lot in compared to data sets individually.
Sorry to say, but this is approach is certainly not the correct way to analyze RNA-seq data. My advice is, like I said before, start over with raw data, ask for raw data instead of some excel files.
@b.nota: This is not standard RNAseq data.
@F: What did HTG support say about downstream analysis of the data?
Thank you,
This is exactly company's reply to my email
I have installed HTG EdgeSeq parser software on my computer, I have fastq files for each sample but I don't why I have 4 fastq files for each sample, technician says that by importing fastq files in software that will return excel file of raw read counts but I am not sure how to manipulate fastq files to combine reads from common genes.
Thank you, I will go through your advice and I know I will need to create some posts in biostars over that :(