HISAT2 - soft-masked genome from UCSC
1
2
Entering edit mode
7.4 years ago
lshepard ▴ 480

Hi,

I recently started using genomes from UCSC, but it seems like they only have soft-masked and hard-masked. Obviously I do not want to use masking for aligning RNA-seq and just wanted to check whether HISAT2 treats the lower-case sequences like the upper case ones to allow mapping to the entire genome regardless of repetitive sequences.

I could not find this information in the documentation, sorry if I missed!

Thank you in advance.

RNA-Seq • 3.9k views
ADD COMMENT
10
Entering edit mode
7.4 years ago

If you take a look at the HISAT2 source code (which nobody should expect you to do) it appears that all FASTA characters are converted to their uppercase representation when reading the reference file:

# ref_read.cpp
while(c != -1 && c != '>') {
    if(rparms.nsToAs && asc2dnacat[c] >= 2) c = 'A';
    uint8_t cat = asc2dnacat[c];
->  int cc = toupper(c);
    ...
ADD COMMENT
2
Entering edit mode

Thanks Matt! It would have been nice to have this clear on the manual, but I have actually just noticed that some of their pre-built indexes are from UCSC, so that is re-assuring.

ADD REPLY

Login before adding your answer.

Traffic: 2434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6