Question

Can't load reference sequence from file 'GRCh37.fa': Unexpected character 'M' found.

1

Entering edit mode

4.0 years ago

Kai_Qi ▴ 130

Hi all:

I am trying to do an analysis using GRCh37.fa as reference genome. After running command

pureclip -i aligned.f.duplRm.pooled.R2.bam -bai aligned.f.duplRm.pooled.R2.bam.bai -g GRCh37.fa -iv "1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;" -nt 10 -o PureCLIP.crosslink_sites.bed

I received an error:

ERROR: Can't load reference sequence from file 'GRCh37.fa': Unexpected character 'M' found.

I got advice from the developer as:

The problem is coming from an external library which is used and which expects the reference sequence to contain only the letters 'A', 'C', 'G', 'T' or 'N'. I know it is not ideal, but if you convert all non-ACGTs to Ns, the problem should be solved

Does anyone can teach me how to convert all non-ACGT to Ns so that I will be able to give it a try?

Thanks,

next-gen-sequencing RNA-Seq • 1.8k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 4.0 years ago by Kai_Qi ▴ 130

0

Entering edit mode

It is indeed strange that your reference contains the letter M. As a first step I would double check that the reference contains nucleotides and not amino acids. Once you are sure that this is the case you can use pyfasta of some other tool depending in which programming language you are proficient.

ADD REPLY • link 4.0 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

It is a reference genome. Thank you for your comment. I used a differential reference genome and it generated bed file. Nevertheless, I could not see peaks when load the bed to IGV. I guess I need to ask around.

Thanks,

ADD REPLY • link 4.0 years ago by Kai_Qi ▴ 130

score 0 · Answer 1 · 2022-02-02

0

Entering edit mode

2.8 years ago

swathishekaran93 • 0

Hi Kai, Got the same error in Pureclip using the reference genome fasta file downloaded from UCSC. To get the clean reference genome, you can try this:

cat hg38as.fa | sed '/^\>/! s/[RYKMSWBVHD]/N/g' > hg38as_clean.fa

It basically replaces any letters in the character set [RYKMSWBVHD] with an N, only in the lines that are not the header aka do not start with >. This fixed the Pureclip error for me.