Hello!
I am trying to optimize the treatment of some RNAseq files by splitting the input reads into several files. I am comparing the results I have obtained with:
- the reads input as one file
- the split input as several files treated in parallel. I merge the SAM files after alignment.
I align with STAR then I assemble the transcriptome with Cufflinks.
On one sample (paired end, around 2Gb per file), I am having these differences of FPKM on this gene: (left value is FPKM of entire file, right is the splitted file)
Inpp4a|XM_006496019.3: 11.08, 9.37
Inpp4a|NM_030266.4: 5.11, 3.67
Inpp4a|NM_001374630.1: 1.06, 4.18
I used BamCompare of Deeptools to understand the difference between the two sample on this gene (NC_000067.7, 37338000->37450000) and the difference (--operation: substract) is less than 0.05 on this region.
With experience, would you consider the FPKM values obtained as different? I consider it as different as Cufflinks provides FPKM confidence interval: second value is outside of the confidence interval.
I would need help to understand which factor can cause this difference and what could be done to fix it?
Any leads or reference is highly appreciated!
Thank you very much!