Hello everyone,
I have a couple of doubts about the query source and the genome that I have to use to create an index to align with HISAT2.
The first is whether it is correct to build the index with a "top level" file from Ensembl Plants (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-55/plants/fasta/zea_mays/dna/) or use the one from NCBI (https://www.ncbi.nlm.nih.gov/genome/?term=zea+mays)
If the correct thing is to use any of the Ensembl Plants, which would be the most ideal?
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.dna.toplevel.fa.gz 615M
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.dna_rm.toplevel.fa.gz 123M
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.dna_sm.toplevel.fa.gz 641M
Description:
'dna' - unmasked genomic DNA.
'dna_rm' - masked genomic DNA.
'dna_sm': masked genomic DNA.
Could you help me clarify my doubts please?
I have read that in masked genomes low complexity and repetitive regions of DNA are detected and replaced with 'N', do you suggest using unmasked?
Thank you for your comments.