Hi. I'm trying to map paired-end rna-seq reads on GRCm38 (mm10) using Hisat2 and Tophat2. But the mapping percentage is almost 0-5%
(hi-seq 2500 and sequencing fragment is 300 bp)
1.fastqc
1) fastqc summary
PASS Basic Statistics
PASS Per base sequence quality
PASS Per tile sequence quality
PASS Per sequence quality scores
FAIL Per base sequence content (file open or like this image : https://rtsf.natsci.msu.edu/_rtsf/assets/Image/fastqc_images/TruSeqRNAPerBaseSeqContent.png PASS Per sequence GC content
PASS Per base N content
PASS Sequence Length Distribution
FAIL Sequence Duplication Levels
PASS Overrepresented sequences
PASS Adapter Content
2) read information
Measure Value
Filename sample_1.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 44728504
Sequences flagged as poor quality 0
Sequence length 101
%GC 50
2.Hisat
1) command
$AnacondaBin/hisat2\
-p 8\
--rg-id=sample \
--rg SM:sample --rg LB:LB --rg PL:Illumina --rg PU:sample\
-x $Reference_dir/Mus_musculus/NCBI/hisatIndex/GRCm38\
--dta \
--rna-strandness FR\
-1 $Fastq_dir/sample_1.fastq.gz\
-2 $Fastq_dir/sample_2.fastq.gz\
-S $Working_dir/Analysis/$Analysis_dir/NCBI/Pre_Tophat/sample_pe.sam 2
2) Result
44728504 reads; of these:
44728504 (100.00%) were paired; of these:
44358669 (99.17%) aligned concordantly 0 times
331704 (0.74%) aligned concordantly exactly 1 time
38131 (0.09%) aligned concordantly >1 times
----
44358669 pairs aligned concordantly 0 times; of these:
11328 (0.03%) aligned discordantly 1 time
----
44347341 pairs aligned 0 times concordantly or discordantly; of these:
88694682 mates make up the pairs; of these:
87830960 (99.03%) aligned 0 times
735195 (0.83%) aligned exactly 1 time
128527 (0.14%) aligned >1 times
1.82% overall alignment rate
**3.Tophat
1) command**
$AnacondaBin/tophat2\
--GTF $Reference_dir//Mus_musculus/UCSC/mm10/Annotation/Archives/archive-2015-07-17-14-33-26/Genes/genes.gtf\ ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml
--output-dir $Working_dir/Analysis/$Analysis_dir/Tophat\
--num-threads 1\
$Reference_dir/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome\ ## from ## from https://ccb.jhu.edu/software/tophat/igenomes.shtml
$Fastq_dir/sample_1.fastq.gz\
$Fastq_dir/sample_2.fastq.gz\
2) result
Left reads:
Input : 44728504
Mapped : 355987 ( 0.8% of input)
of these: 7756 ( 2.2%) have multiple alignments (0 have >20)
Right reads:
Input : 44728504
Mapped : 347193 ( 0.8% of input)
of these: 7342 ( 2.1%) have multiple alignments (0 have >20)
0.8% overall read mapping rate.
Aligned pairs: 159136
of these: 1209 ( 0.8%) have multiple alignments
218 ( 0.1%) are discordant alignments
0.4% concordant pair alignment rate.
- Other try..
1) first 10 bp trimming from fastq read 1 and read 2 files.
--> But the result was also too extremely low rate alignment.
2) I've been seen this comment.
Reference speices diverse
I have the same problem! have you downloaded the index from HISAT2? I did, even trying with mm9 I get the same alignment rate, I am using public NGS data :( which it is suposed to be mouse!...
Did you check your data source? I checked my data. And I identified my data wasn't mouse sequence. (by Carlo Yague's comment)
After I map my data to human reference, I got 95%. mapping percentage. And I ran hisat index following pipelines.
Download Reference genome
https://ccb.jhu.edu/software/tophat/igenomes.shtml
Build hisat2 index echo "2-1. Build Hisat2 index (Default Options)" $AnacondaBin/hisat2-build\ $Reference_dir/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/hisat2_index/mm10_genome.fa\ mm10_genome
Yes of course, I have mapped the data to some related genomes including human, finally I will write to the corresponding author :).