I ran HISAT2 on my reads post adaptor removal, and received very high discordant alignments (>80%) with low concordant alignments.
81987410 reads; of these: 81987410 (100.00%) were paired; of these: 81584433 (99.51%) aligned concordantly 0 times 92014 (0.11%) aligned concordantly exactly 1
time 310963 (0.38%) aligned concordantly >1 times
---- 81584433 pairs aligned concordantly 0 times; of these: 64183926 (78.67%) aligned discordantly 1 time ---- 17400507 pairs aligned 0 times concordantly or discordantly; of these: 34801014 mates make up the pairs; of these: 3133661 (9.00%) aligned 0 times 9518238 (27.35%) aligned exactly 1 time 22149115 (63.65%) aligned >1 times
98.09% overall alignment rate
My reads are very short (about 50 bp), and initially I had 2 lanes so I concatenated the two files. Following concatenation, I used how_are_we_stranded_here to infer RF strandedness. I then ran HISAT2 using the following command, based on an index I generated from the mouse genome (v38):
for sample in $(ls adaptor_trimmed/*.fastq.gz | rev | cut -d "/" -f 1 | cut -d "_" -f 2 | rev | sort | uniq) do
hisat2 -p 16 --rg-id=${sample} --rg PL:ILLUMINA -x /RNA_references/Mus_musculus.GRCm38.dna.primary_assembly_index --dta
--rna-strandness RF \
-1 /adaptor_trimmed/${sample}_R1.fastq.gz \
-2 /adaptor_trimmed/${sample}_R2.fastq.gz \
-S ./${sample}.sam done
I have no idea why I am receiving such high discordant alignments. Could it be that my reads are not RF? Should I try reversing the order of the reads for R1 and R2? I would appreciate any insight or trouble shooting advice.
Thank you!
The simplest thing is to try FR (use on a small subset of reads, maybe the first 800.000 lines of each fastq) to test whether this solves it. The overall alignment rate means data are not crap/problematic, but just some option skews up the definition of concordance with strandedness expectation. Or run a tool to detect strandedness. If you search for "hisat2 stranded" you find many suggestions in older threads.
Folder name seems to indicate the data is trimmed. If the trimming was done independently for the paired-end data then this would explain discordant alignments.
Please scan/trim paired-end data together. That should fix this problem.
Hi ATpoint and GenoMax , thank you for your replies! I ran the paired-end data together for the trimming, I have attached my code:
I also tried to use FR as you suggested, but this ended up with reads with the same % of discordancy. Could this have to do with the way I wrote the loop? I'm kind of at a loss, but I would appreciate any guidance on trouble shooting.
Thank you