Question

hisat2 parameters for paired end and single end reads from same samples

0

Entering edit mode

7.8 years ago

bioinfo17 ▴ 30

Hi,

I have paired end R1.fastq, R2.fastq and singletons.fastq files for the same samples. What parameters should I use for aligning reads against a genome of interest?

1) hisat2 -1 R1.fastq, -2 R2.fastq, -U singletons.fastq

or just treat all the files as unpaired and use

2) hisat2 -U R1.fastq, R2.fastq, singletons.fastq

Tophat had the options of using paired end and single reads together but unsure how hisat2 would work taking into account both paired end and single reads from the same samples. I will be using the bam files generated from hisat2 for downstream differential expression analysis using stringtie.

Should one ignore the singleton reads and just use the paired-end reads (use only -1, -2 and not -U) or Should treat all reads as unpaired and use -U R1.fastq, R2.fastq, singletons.fastq

How would stringtie calculate the counts of transcripts generated from hisat2 using different parameters as mentioned above?

(I had a look at the counts tables generated by 1 and 2 and I get completely different results. For example using method 1, I get a high number of counts of certain transcripts in sampleX and using method 2, I get no counts of the same transcripts in the same sample and sometimes the resulting counts are vice-versa in different samples.)

I found a similar post here but couldn't find a definitive conclusion: https://github.com/feltus/OSG-GEM/issues/10

Any advice will be appreciated, thanks.

hisat2 alignment RNA-Seq stringtie hisat • 9.0k views

ADD COMMENT • link updated 7.8 years ago by Matteo Schiavinato ★ 3.7k • written 7.8 years ago by bioinfo17 ▴ 30

score 3 · Accepted Answer · 2017-09-14

What parameters should I use

Parameters are a very delicate matter. You shouldn't ask other people for "parameters", rather you should state what you want to achieve. If the reads come from the same organism of the genome you're mapping them against (I assume this is your case) then the default parameters are close to the best ones.

just treat all the files as unpaired

No, never miss the chance to map reads as paired if they are. The insert size plays a role in this. Perhaps have an estimation of the insert size by mapping a subset of reads and then extract the positive values in the TLEN field of the output SAM file. Make a plot of the numbers you extracted and see where your peak is. Then you can set -I and -X (min and max accepted insert size), defining a range of, i don't know, 200 bp. The program will try to map pairs at that distance first, even if they contain mismatches, and that makes sense because that is their actual distance. If you treat them as unpaired, it will map each read separately and will find a lot of misalignments (hence your weird expression results).

Should one ignore the singleton reads

Nope, that's biological information. It makes your analysis harder but having a bam file with singletons and paired end reads inside is more biologically informative than having only the paired (remember singletons come out of quality trimming, not because their mate is unmapped).

How would stringtie calculate the counts of transcripts

Stringtie counts the number of reads that cover each position and then (if I'm not wrong) normalizes this number according to the number of isoforms that the gene has, redistributing counts among different isoforms depending on the evidence that they have. But for this one go through their manual, it's clearly stated.