Question

How to bring consistency for genome reference, annotation and index files?

0

Entering edit mode

6.2 years ago

Vasu ▴ 790

Hi,

I have genome reference, annotation files and hisat2-index from Hisat2 website

Genome reference: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/GRCh38.p10.genome.fa.gz

GENCODE gene annotation: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz

hisat2 index : genome_snp_tran (GRch38) from [https://ccb.jhu.edu/software/hisat2/index.shtml]

I see that genome.fa and gencode gtf file has chromosome names with chr started. But the hisat-2 index file doesn't have chr in it.

Should I remove chr from fasta and gtf files? Or should I build own hisat2-index?

RNA-Seq hisat2 genome gencode annotation • 2.9k views

ADD COMMENT • link 6.2 years ago by Vasu ▴ 790

3

Entering edit mode

as ATpoint mentioned both are indeed technically OK.

I would additionally suggest to evaluate how much 'dependencies' you have for each of the file types (fasta etc). I mean if you also have a genome browser associated to it, blastDBs ... if those are plenty it might be more feasible to rebuild the hisat index, instead of start removing chr for a whole list of files and related resources

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

If I build hisat2-index on my own will that index has chr in it?

Do you think this is the right way to build the hisat2-index? Got this from this paper Hisat2 stringtie paper

extract_splice_sites.py data/gencode.v27.annotation.gtf > gencode.v27.annotation.ss
extract_exons.py data/gencode.v27.annotation.gtf > gencode.v27.annotation.exon

Second, build a HISAT2 index:

hisat2-build --ss gencode.v27.annotation.ss --exon gencode.v27.annotation.exon data/genome/GRCh38.p10.genome.fa grch38_tran

ADD REPLY • link 6.2 years ago by Vasu ▴ 790

1

Entering edit mode

If I build hisat2-index on my own will that index has chr in it?

If your fasta has "chr" then your index will have "chr".

"chr" is just a part of the name of chromosomes. If your fasta file looks like

>myfabulouschromosome1
ACGT

then that is perfectly valid. Just make sure your annotation and fasta file use the same myfabulouschromosome1 notation.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, this is how gtf and fasta files look.

Gencode GTF file:

##description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90)
##provider: GENCODE
##contact: gencode-help@sanger.ac.uk
##format: gtf
##date: 2017-08-01
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

Fasta file:

>chr1 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Also other lncipedia_gtf:

##gtf
track name='LNCipedia' description='LNCipedia 5.0 - www.lncipedia.org' color=73,157,74 db=hg38 url='http://lncipedia.org/db/transcript/$$'
chr1    lncipedia.org   exon    83801516        83803251        .       -       .       gene_id LINC01725 ; transcript_id LINC01725:19 ; gene_alias_1 ENSG00000233008 ; gene_alias_2 RP11-475O6.1 ; gene_alias_3 ENSG00000233008.1 ; gene_alias_4 OTTHUMG00000009930.1 ; gene_alias_5 ENSG00000233008.5 ; gene_alias_6 LINC01725 ; gene_alias_7 LOC101927560 ; transcript_alias_1 ENST00000457273 ; transcript_alias_2 ENST00000457273.1 ; transcript_alias_3 RP11-475O6.1-005 ; transcript_alias_4 OTTHUMT00000027496.1 ; transcript_alias_5 NONHSAT004171 ; transcript_alias_6 NR_119374 ; transcript_alias_7 ENST00000457273.5 ; transcript_alias_8 NR_119374.1 ;

ADD REPLY • link 6.2 years ago by Vasu ▴ 790

0

Entering edit mode

Hi Wouter,

I have an error while building the index.

hisat2-build -ss gencode.v27.annotation.ss --exon gencode.v27.annotation.exon GRCh38.p10.genome.fa genome_tran
## Thu Sep 27 17:21:53 CEST 2018
Settings:
  Output files: "GRCh38.p10.genome.fa.*.ht2"
  Line rate: 7 (line is 128 bytes)
  Lines per side: 1 (side is 128 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Local offset rate: 3 (one in 8)
  Local fTable chars: 6
  Local sequence length: 57344
  Local sequence overlap between two consecutive indexes: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: enabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  gencode.v27.annotation.ss
Reading reference sizes
  Time reading reference sizes: 00:00:01
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
Reference file does not seem to be a FASTA file
  Time to join reference sequences: 00:00:00
Total time for call to driver() for forward index: 00:00:01
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -ss --exon gencode.v27.annotation.exon gencode.v27.annotation.ss GRCh38.p10.genome.fa genome_tran 
Deleting "GRCh38.p10.genome.fa.1.ht2" file written during aborted indexing attempt.
Deleting "GRCh38.p10.genome.fa.2.ht2" file written during aborted indexing attempt.
Deleting "GRCh38.p10.genome.fa.3.ht2" file written during aborted indexing attempt.
Deleting "GRCh38.p10.genome.fa.4.ht2" file written during aborted indexing attempt.
Deleting "GRCh38.p10.genome.fa.5.ht2" file written during aborted indexing attempt.
Deleting "GRCh38.p10.genome.fa.6.ht2" file written during aborted indexing attempt.
Deleting "GRCh38.p10.genome.fa.7.ht2" file written during aborted indexing attempt.
Deleting "GRCh38.p10.genome.fa.8.ht2" file written during aborted indexing attempt.
## Thu Sep 27 17:21:54 CEST 2018

What could be the problem here?

ADD REPLY • link 6.2 years ago by Vasu ▴ 790

1

Entering edit mode

It should be --ss and not -ss.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Oh ya...so sorry. Thank you !!

ADD REPLY • link 6.2 years ago by Vasu ▴ 790

1

Entering edit mode

Both is ok, but removing chr is probably much faster.

ADD REPLY • link 6.2 years ago by ATpoint 85k