Question

Bowtie2/HISAT using single-end to generate FPKM???

2

Entering edit mode

7.6 years ago

jamieson.pierce ▴ 30

Hi all,

I am familiar with the fact that paired-end reads are used to generate FPKM, and single-end reads are used to make RPKM. I recently received a sample report from a third party RNA-seq service which provided me with FPKM normalized read counts for each transcript and sample in spreadsheet.

That was fine until I examined the .fastQ files they gave me, I found the following format for each read

@XX100011323L1C001R014_28
GAAAAACTCAAATCGCCTCTAAGAAAAGACGAAGTCGAAGAAAGAGACAA
+
eeeeeeeeeeeeeeeeee\eeeeeeeeedeeeeeefeeZeeeeeeeefZc

Given that there is no /1 or /2 at the end of the @ identifier, and further that all those @IDs in the .fastQ file contain an "R" (suggesting reverse?), I am wondering how in the hell they generated FPKM-- and most importantly whether or not these people have just given me the runaround. Is this an interleaved .fastQ?

In their report they provided the parameters they would use for mapping both PE and SE reads in Bowtie2 and HISAT, which seems odd, since they only seem to provide FPKM to everyone. Here are the arguments:

Bowtie2 parameters for PE reads: -q --phred64 --sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --
score-min L,0,-0.1 -I 1 -X 1000 --no-mixed --no-discordant -p 16 -k 200 
Bowtie2 parameters for SE reads: -
q --phred64 --sensitive --dpad 0 --gbar
99999999 --mp 1,1 --np 1 --score-min L,0,-0.1 -p 16 -k 200
HISAT parameters for PE reads: -p 8 --phred64 --sensitive --no-discordant --no-mixed -I 1 -X 1000
HISTA parameters for SE reads: -p 8 --phred64 --sensitive -I 1 -X 1000

After that, they said they used RSEM to calculate FPKM.

Then, laughably, they said,

The FPKM method is able to eliminate the influence of different gene length and sequencing discrepancy on the calculation of gene expression. Therefore, the calculated gene expression can be directly used for comparing the difference of gene expression among samples.

Which I think we all know isn't quite true unless you're using a trimmed mean of M adjustment anyway because total FPKM/sample is always a little different.

If anyone can give me some insight on this one, I'd be much obliged.

RNA-Seq Bowtie2 HISAT FPKM RPKM • 3.5k views

ADD COMMENT • link updated 7.6 years ago by Brian Bushnell 20k • written 7.6 years ago by jamieson.pierce ▴ 30

2

Entering edit mode

For single-end reads, RPKM==FPKM...

ADD REPLY • link 7.6 years ago by Devon Ryan 105k

0

Entering edit mode

Did you contract both sequencing and bioinformatics? Have you been given the raw sequencing data? I am guessing they used some software to process the raw reads (adapter and quality trimming, etc), which resulted in PE and SE files. Did you check all fastq files, at least 5 lines from each? The file names should follow some sort of sane naming convention.

ADD REPLY • link 7.6 years ago by h.mon 35k

0

Entering edit mode

My PIs contracted both seq and bioinf. Yes in the documentation they removed adapters and quality trimmed. (Not sure what software was used)

How would this create SE and PE files? I also checked 10,000 reads in the fastQ and found them all named with the same convention. No F or RF or FR or /1 or /2 or anything that could feasibly be interpreted as paired end.

Based on everyone's answer here I'm guessing they took SE aligned reads and did "FPKM" calculations using RSEM which I guess are mathematically indistinguishable from RPKM in this case?

ADD REPLY • link 7.6 years ago by jamieson.pierce ▴ 30

0

Entering edit mode

How would this create SE and PE files?

If one of the reads in a pair gets removed during trimming that would leave a single-end read in the other file. For the sake of sanity, I generally prefer to discard both reads when that happens. It keeps the PE reads in proper order in R1/R2 files.

ADD REPLY • link 7.6 years ago by GenoMax 148k

0

Entering edit mode

Someone is not providing you all the information you need to do a proper job, so you should:

1) get the raw sequencing data 2) get a complete description of the analyses performed - software and commands used, if possible

Either your PI or the center should have those.

ADD REPLY • link 7.6 years ago by h.mon 35k

0

Entering edit mode

Well, I hope they do a decent job sequencing because their bioinformatics service looks doubtful. It seems you are better off getting access to the rawest data you can find, and do things yourself.

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

0

Entering edit mode

For paired-end reads, the FPKM is much better than RPKM. However, for single-end reads, the results of FPKM are the same as the RPKM.

ADD REPLY • link 7.6 years ago by Ben ▴ 60

score 5 · Answer 1 · 2017-05-26

FPKM does not require paired-end reads. For single-ended reads, it makes the calculation much simpler since you never have the problem of the two ends mapping to different genes :)

It would be helpful if you could post the first... 16 lines of the file rather than the first 4 lines, to see if maybe the names are the same, or something, which might indicate the reads are interleaved. But if I were you, I would demand the raw data from the 3rd-party source rather than the weird renamed stuff they gave you.

And yes, their comment is completely wrong which makes me wonder how competent they are. There is interplay between library insert size and gene length that affects relative gene coverage and is not modeled by FPKM.