Question

how to handle reads that mapped to multiple genes

1

Entering edit mode

6.8 years ago

Ana ▴ 200

Hello every one, this is a question related to my previous question that I posted. I have aligned a bunch of whole genome shot reads against Pangenome proteins by using DIAMOND protein aligner. After filtering for e-value, identity percentage and bit score, I still get some reads which perfectly mapped to multiple genes, below is one the examples:

HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha12_00033467   100.0   33  0   0   100 2   91  123 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha10_00000535   100.0   33  0   0   100 2   116 148 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha7_00045828    100.0   33  0   0   100 2   557 589 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha17_00008591   100.0   33  0   0   100 2   118 150 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha15_00038715   100.0   33  0   0   100 2   173 205 7.8e-12 65.9

The purpose of my study is to find which genes from the Pangenome data is present in my samples and which gene are absent. I just want to know is there any way that I could identify a read originated from which of these genes? Thanks so much for any suggestion and help

alignment whole genome sequencing • 1.3k views

ADD COMMENT • link updated 6.8 years ago by h.mon 35k • written 6.8 years ago by Ana ▴ 200

score 1 · Answer 1 · 2018-01-25

Are these orthologous genes (genes in different species that have evolved from a common ancestral gene via speciation)? While the read you have may map perfectly to a part, the Ha* may have differences when looked in entirety.

Instead of looking at individual reads perhaps you should try to assemble them and then use the assembled contigs to see what they are most similar to or what their distribution looks like.

Where did the pan-genome come from?

score 0 · Answer 2 · 2018-01-25

I don't know if it would work, but you may try to output the DIAMOND alignments to SAM output, then use Salmon in alignment-based mode to quantify the reads. Salmon will use a expectation-maximization algorithm to optimally allocate multi-mappers counts to genes.

Another approach would be you (or a collaborator) develop an expectation-maximization algorithm for your particular use case.