I've been facing a problem with wildcards (IUPAC ambiguity codes) when using GEM mapper to map RNA-seq short reads against a custom reference sequence (index).
As far as I understood by reading the documentation of gem-indexer and gem-mapper, it would be OK if I had N's either in my index or in my reads. However, I made a small example of an index and a paired-end read with N's, and the mapping isn't working as I expected. GEM returns mismatches at the positions in which there is an N, instead of considerig N-[AGCT] as a perfect match.
So, using the example indices and reads below, I get mismatches when trying to map a paired-end read without N's against an index with N's, and vice-versa.
Is there any argument I should be using in order to consider N's as wildcards and return N-[AGCT] as a perfect match?
INDEX with N:
ATGGCCGTCATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACCCA
GACCTGGGCGGGCTCTCACTCCATGAGGTATTTCTACACCTCCGTGTCCCGGCCCGGCCGCG
GGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGC
GACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGTCCGGAGTA
TTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACAGACTCACCGAGTGGACCTGGGGA
CCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACCGTCCAGAGGATGTATGGC
TGCGACGTGGGGTCGGACTGGCGCTTCCTCCGCGGGTACCACCAGTACGCCTACGACGGCAA
GGATTACATCGCCCTGAAAGAGGACCTGCGCTCTTGGACCGCGGCGGACATGGCAGCTCAGA
CCACCAAGCACAAGTGGGAGGCGGCCCATGTGGCGGAGCAGTTGAGAGCCTACCTGGAGGGC
ACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCACGGA
NGCCCCCAAAACGCATATGACTCACCANGCTGTCTCNGACCATGAAGCCACCCTGAGGTGCT
GGGCCCTGAGCT
INDEX without Ns:
ATGGCCGTCATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACCCA
GACCTGGGCGGGCTCTCACTCCATGAGGTATTTCTACACCTCCGTGTCCCGGCCCGGCCGCG
GGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGC
GACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGTCCGGAGTA
TTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACAGACTCACCGAGTGGACCTGGGGA
CCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACCGTCCAGAGGATGTATGGC
TGCGACGTGGGGTCGGACTGGCGCTTCCTCCGCGGGTACCACCAGTACGCCTACGACGGCAA
GGATTACATCGCCCTGAAAGAGGACCTGCGCTCTTGGACCGCGGCGGACATGGCAGCTCAGA
CCACCAAGCACAAGTGGGAGGCGGCCCATGTGGCGGAGCAGTTGAGAGCCTACCTGGAGGGC
ACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCACGGA
AGCCCCCAAAACGCATATGACTCACCAAGCTGTCTCAGACCATGAAGCCACCCTGAGGTGCT
GGGCCCTGAGCT
READ 1
@read.1
GAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCG
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
READ 2 with N
@read.2
AGCTCAGGGCCCAGCACCTCAGGGTGGCTTCATGGTCNGAGACAGCNTGGTGAGTCATATGCGTTTTGGGGGCNT
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
READ 2 without N
@read.2
AGCTCAGGGCCCAGCACCTCAGGGTGGCTTCATGGTCTGAGACAGCTTGGTGAGTCATATGCGTTTTGGGGGCTT
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I'm running gem with the following commands:
GEM Indexer:
gem-indexer -i index_withN.fasta -o index_withN
GEM Mapper
# the paired-end read will not map with this command:
gem-mapper -I index_withN.gem -p -1 read1.fastq -2 read2_withoutN.fastq -q ignore -m 0 -o test
# the paired-end read will map with mismatches at N's with this command:
gem-mapper -I index_withN.gem -p -1 read1.fastq -2 read2_withoutN.fastq -q ignore -m 0 -o test --mismatch-alphabet ACGNT
it is probably best if you asked the author, it seems like a bug IMO, let us know how it turns out
I already did. I sent an email to GEM and gemTools authors weeks ago, but unfortunately they didn't reply.
Did you ever got this problem solved? Just that I am also looking for an RNA-seq read mapper that would not consider mapped positions to an N reference as a mismatch? Are you aware of any other mappers that would do it correctly perhaps?