Hi all:
I am trying to do an analysis using GRCh37.fa as reference genome. After running command
pureclip -i aligned.f.duplRm.pooled.R2.bam -bai aligned.f.duplRm.pooled.R2.bam.bai -g GRCh37.fa -iv "1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;" -nt 10 -o PureCLIP.crosslink_sites.bed
I received an error:
ERROR: Can't load reference sequence from file 'GRCh37.fa': Unexpected character 'M' found.
I got advice from the developer as:
The problem is coming from an external library which is used and which expects the reference sequence to contain only the letters 'A', 'C', 'G', 'T' or 'N'. I know it is not ideal, but if you convert all non-ACGTs to Ns, the problem should be solved
Does anyone can teach me how to convert all non-ACGT to Ns so that I will be able to give it a try?
Thanks,
It is indeed strange that your reference contains the letter M. As a first step I would double check that the reference contains nucleotides and not amino acids. Once you are sure that this is the case you can use pyfasta of some other tool depending in which programming language you are proficient.
It is a reference genome. Thank you for your comment. I used a differential reference genome and it generated bed file. Nevertheless, I could not see peaks when load the bed to IGV. I guess I need to ask around.
Thanks,