Script from @biocyberman is great. For those who are still looking for this, here are steps:
1) Get the reference hg38 genome, in uncompressed fasta format (obviously)
2) Get the 'gemtools':
a) SourceForge --> Version 2, pre release
or
b) GitLab --> Version 1.7
3) Create the index of the genome by using following command:
gemtools index -i <referenceGenome.fasta:PATH> -t <NumberOfThreadsToUse:int>
e.g.:
gemtools index -i GRCh38.p12.fasta -t 40
4) Once, the index is made, generate the mappability data like:
gem-mappability -I <referenceGenome_index.gem:PATH> -l <K-mer_length:int> -o <output_name.gem:PATH>
e.g.: For 100 mer, the command will be:
gem-mappability -I GRCh38.p12.gem -l 100 -o GRCh38_mappability_100mer.gem
The complete guide is available at this wiki
What do you mean GEM doesn't have a library for this? That's what
gem-mappability
is for. I just downloaded the binary and it's there.My mistake !! I looked it in binary-pre-release-2, which does not have the binary for mappability but release 3 has it. Thanks !!
As of September 2017 the gem-mappability binary is missing from the GEM build. The note on the GEM wiki dated 2012 states "mappability tools are still in the process of being migrated to the new mapping engine. The mappability will come back online shortly." (wiki for the GEM library). There is also no ENCODE version of the mappability track for hg38. Is my best bet really doing a liftOver on wgEncodeCrgMapabilityAlign100mer.bigWig from hg19 to hg38?
"rn6.softmask.all.fa" is the reference genome file with repetitive regions soft masked.
many thanks, I understand what the rn6.softmask.all.fa file is. I dont undestand what rn6.softmask.all file is (the line above) Are you saying they are the same thing?
Well, that line is just the declaration of variable named "pref" which is used in code below wherever the file name prefix is required. Simply put, its used to create file name prefix.
For example, below line will assign variable "idxpref" with value "rn6.softmask.all_index"
idxpref="${pref}_index"
oh! thanks so much! I assumed i was missing an index file of some sort!!
thanks again..
If you are working with UCSC hg19 there are pre-built mappability files from ENCODE available here.
I am running the following code:
gem-indexer -i Homo_sapiens-GRCh38.p13_contigs.fa -o Homo_sapiens-GRCh38.p13_contigs -threads 8
And I am getting this error:
GEM::Error (archive_builder_text.c:180,archive_builder_text_process_character) MultiFASTA parsing (genome/Homo_sapiens-GRCh38.p13/Homo_sapiens-GRCh38.p13_contigs.fa:46604068). Invalid character found ('>')
Any idea how to solve it?
May be Homo_sapiens-GRCh38.p13_contigs.fa file has the ">" as a part of sequence instead of newline. Check the file once.