Hello!
I'm using HISAT2
tool for mapping a RNA-Seq PE
dataset. The reference genome I want to align against is GRCh37
. I have downloaded genome_snp_tran pre-built index and run hisat2. For other hand, I tried to create my own indexes using the following command:
hisat2-build -p 6 GRCh37-75/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa HISAT_index/
I have run Stringtie
with both bam files and the results are quite different. Anyone can guide me to know why? Is because for the generation of my own indexes I'm not using SNP and transcript information?: genome_snp_tran: HGFM index for reference plus SNPs and transcripts
Is better to use pre-built indexes rather than creating new ones?
EDIT
Here I post some of the differences seen in Stringtie output for the two BAM files:
- Number of reference transcripts (ENST....) reported in
*assembled_transcripts.gtf
file: 64500 (pre-built) / 44116 (own) - 1442 transcript missing in pre-built but present in own.
- 21826 transcripts missing in own but present in pre-built.
- 42673 transcripts in common, let's evaluate the concordance in FPKM values:
avg = 6.819 / 7.421 std = 82.219 / 90.447 max = 8040.339 / 9355.599
EDIT 2:
Let's take a look to a particular transcripts which gives completely different results in Stringtie
depending on the hisat2
index used: ENST00000331789
Results for pre-built index:
Format: Chr start end transcript_id gene_id FPKM
7 5566787 5570232 ENST00000425660 ENSG00000075624 646.288086
7 5566782 5570340 ENST00000331789 ENSG00000075624 20.998787
7 5566787 5570232 ENST00000462494 ENSG00000075624 1949.242188
7 5567742 5570233 ENST00000484841 ENSG00000075624 179.878967
7 5567372 5569294 ENST00000493945 ENSG00000075624 13.227101
7 5568223 5603415 ENST00000432588 ENSG00000075624 15.464212
7 5568866 5569613 ENST00000417101 ENSG00000075624 0.410293
7 5568101 5570221 ENST00000477812 ENSG00000075624 0.111127
7 5566782 5567729 ENST00000464611 ENSG00000075624 0.061247
7 5567781 5570235 ENST00000473257 ENSG00000075624 0.002322
7 5568698 5570214 ENST00000480301 ENSG00000075624 0.568141
Results for own index:
7 5566782 5570340 ENST00000331789 ENSG00000075624 4963.816895
7 5566787 5570232 ENST00000462494 ENSG00000075624 105.154701
7 5567372 5569294 ENST00000493945 ENSG00000075624 18.443560
7 5568101 5570221 ENST00000477812 ENSG00000075624 1.338749
7 5568698 5570214 ENST00000480301 ENSG00000075624 0.182248
In what ways are they "quite different"?
Sorry @Devon Ryan, my question wasn't clear. See my edit. I'm thinking that maybe the difference is not that big...
I believe the index you built would be equivalent to the
genome
index, not to thegenome_snp_tran
index.Yes. So the SNP and transcript information while building the index is "very" important? I mean, we assume that the analysis performed using
genome_snp_tran
index is better than the other?Yes, though I would really encourage you not to use hisat2 if you care about finding splice sites unless you like tweaking settings.
Mmmm... Can you take a look to my second edit? Why using prebuilt indexes all the weight (expression) is given to
ENST00000462494
while using own indexes is given toENST00000331789
?Have a look at the BAM files, I suspect that'll be rather more telling.
What is the difference between
genome
,genome_tran
andgenome_snp_tran
Genome is the basic index of the genome.
genome_tran
additionally includes annotated splicing boundaries.genome_snp_tran
additionally includes a number of SNPs, so you can (theoretically) get better alignment around them.