Hi all,
I am familiar with the fact that paired-end reads are used to generate FPKM, and single-end reads are used to make RPKM. I recently received a sample report from a third party RNA-seq service which provided me with FPKM normalized read counts for each transcript and sample in spreadsheet.
That was fine until I examined the .fastQ files they gave me, I found the following format for each read
@XX100011323L1C001R014_28
GAAAAACTCAAATCGCCTCTAAGAAAAGACGAAGTCGAAGAAAGAGACAA
+
eeeeeeeeeeeeeeeeee\eeeeeeeeedeeeeeefeeZeeeeeeeefZc
Given that there is no /1 or /2 at the end of the @ identifier, and further that all those @IDs in the .fastQ file contain an "R" (suggesting reverse?), I am wondering how in the hell they generated FPKM-- and most importantly whether or not these people have just given me the runaround. Is this an interleaved .fastQ?
In their report they provided the parameters they would use for mapping both PE and SE reads in Bowtie2 and HISAT, which seems odd, since they only seem to provide FPKM to everyone. Here are the arguments:
Bowtie2 parameters for PE reads: -q --phred64 --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --
score-min L,0,-0.1 -I 1 -X 1000 --no-mixed --no-discordant -p 16 -k 200
Bowtie2 parameters for SE reads: -
q --phred64 --sensitive --dpad 0 --gbar
99999999 --mp 1,1 --np 1 --score-min L,0,-0.1 -p 16 -k 200
HISAT parameters for PE reads: -p 8 --phred64 --sensitive --no-discordant --no-mixed -I 1 -X 1000
HISTA parameters for SE reads: -p 8 --phred64 --sensitive -I 1 -X 1000
After that, they said they used RSEM to calculate FPKM.
Then, laughably, they said,
The FPKM method is able to eliminate the influence of different gene length and sequencing discrepancy on the calculation of gene expression. Therefore, the calculated gene expression can be directly used for comparing the difference of gene expression among samples.
Which I think we all know isn't quite true unless you're using a trimmed mean of M adjustment anyway because total FPKM/sample is always a little different.
If anyone can give me some insight on this one, I'd be much obliged.
For single-end reads, RPKM==FPKM...
Did you contract both sequencing and bioinformatics? Have you been given the raw sequencing data? I am guessing they used some software to process the raw reads (adapter and quality trimming, etc), which resulted in PE and SE files. Did you check all fastq files, at least 5 lines from each? The file names should follow some sort of sane naming convention.
My PIs contracted both seq and bioinf. Yes in the documentation they removed adapters and quality trimmed. (Not sure what software was used)
How would this create SE and PE files? I also checked 10,000 reads in the fastQ and found them all named with the same convention. No F or RF or FR or /1 or /2 or anything that could feasibly be interpreted as paired end.
Based on everyone's answer here I'm guessing they took SE aligned reads and did "FPKM" calculations using RSEM which I guess are mathematically indistinguishable from RPKM in this case?
If one of the reads in a pair gets removed during trimming that would leave a single-end read in the other file. For the sake of sanity, I generally prefer to discard both reads when that happens. It keeps the PE reads in proper order in R1/R2 files.
Someone is not providing you all the information you need to do a proper job, so you should:
1) get the raw sequencing data 2) get a complete description of the analyses performed - software and commands used, if possible
Either your PI or the center should have those.
Well, I hope they do a decent job sequencing because their bioinformatics service looks doubtful. It seems you are better off getting access to the rawest data you can find, and do things yourself.
For paired-end reads, the FPKM is much better than RPKM. However, for single-end reads, the results of FPKM are the same as the RPKM.