Hello everyone, I have aligned a bunch of whole genome shotgun sequencing data against Pangenome protein sequences by using DIAMOND protein aligner to find out which genes are present and which are absent in my DNA sequence data . I am sorry if my question is dumb, but I am new in alignment and learning. I have some genes which are 100% matched with mutiple DNA sequences, please see the example below! (this the output of DIAMOND which is Blast tabular file format)
HWI-ST913:300:C5W5DACXX:7:1207:13092:33669 Ha15_00039042 100.0 24 0 0 3 74 183 206 1.5e-07 51.6
HWI-ST913:300:C5W5DACXX:7:1208:1383:39276 Ha15_00039042 100.0 26 0 0 22 99 175 200 1.3e-09 58.5
HWI-ST913:300:C5W5DACXX:7:1208:3015:81325 Ha15_00039042 100.0 29 0 0 2 88 178 206 1.1e-10 62.0
I want to know what does this mean? Is this something that I should expect or something is going wrong? Thanks for any help or providing me with some references to learn about this!
It's three reads that were mapped to the same gene, to different positions along the gene (183-206, 175-200 and 178-206). Looks like your alignment is working as it should.
Thanks a lot Asaf, got it now. Another little question, how about if the same read matches to multiple genes? like this: could it be possible?
Reads can certainly multi-map. Think about multi-copy genes (e.g. rDNA repeat) that may be present in multiple locations on the genome.
Thanks @genomax, very helpful. Then my last question is that if the same read matches to multiple genes then can I say all these genes present in my DNA sequence data-set? Because the final aim of doing this alignment is to understand which gene is present and which gene is absent in my DNA sequence data-set? Thanks a lot.
No.
With the information you have (i.e. sequence of the read) you can't say for sure which copy of the gene (multiple) it originally came from. It could be from any one of them. There is also a possibility that the read could be aligning there by chance (due to sequence similarity of that piece) and may not have originated from the gene it is aligned to.
Note: While mapping you have multiple choices on how to handle reads the multi-map. Following is an example of options from BBMap suite (all aligners may not have same options).
In RNAseq , multi-mapping reads are generally not counted by default.