I want to map a small number of reads against a comprehensive database (I used nt) to understand to which organism they do belong (as a confirmation/refinement after using metagenomics software). I can do that in blast, but I though that magic-blast could be better. The results I got are reasonable as far as the reads provenance is concerned, but I have some issue in understanding fully the SAM output format.
Assuming nt is the name of the local nt database I issued this command.
magicblast -query read_1.fastq -query_mate read_2.fastq -infmt fastq -db nt -no_unaligned -num_threads 8 -outfmt sam -out myout.sam
The sam file for a selected read pair looks like this
D00352:469:CD2JDANXX:6:2211:12452:28036 73 gi|400173048 1322861 255 86M39S * 0 0 TTATTCGGCGTGGTCGCACCGCTGGCGATCACCGGAGCATCAATCGCCAGCAGTGAGCCGGTCAGGGTGACGCTGGTGGAAACAATACGCTGCGGCTGGCTTTCCAGTGTATGTGTGCCACGGCT * NH:i:4 AS:i:56 NM:i:6 MD:Z:29A5C8T8G14C11G5
D00352:469:CD2JDANXX:6:2211:12452:28036 345 gi|723220061 3480222 255 39S86M * 0 0 AGCCGTGGCACACATACACTGGAAAGCCAGCCGCAGCGTATTGTTTCCACCAGCGTCACCCTGACCGGCTCACTGCTGGCGATTGATGCTCCGGTGATCGCCAGCGGTGCGACCACGCCGAATAA * NH:i:4 AS:i:56 NM:i:6 MD:Z:5C11G14C8A8G5T29
D00352:469:CD2JDANXX:6:2211:12452:28036 329 gi|1020388573 2063846 255 86M39S * 0 0 TTATTCGGCGTGGTCGCACCGCTGGCGATCACCGGAGCATCAATCGCCAGCAGTGAGCCGGTCAGGGTGACGCTGGTGGAAACAATACGCTGCGGCTGGCTTTCCAGTGTATGTGTGCCACGGCT * NH:i:4 AS:i:56 NM:i:6 MD:Z:29A5C8T8G14C11G5
D00352:469:CD2JDANXX:6:2211:12452:28036 329 gi|1067038464 1215244 255 86M39S * 0 0 TTATTCGGCGTGGTCGCACCGCTGGCGATCACCGGAGCATCAATCGCCAGCAGTGAGCCGGTCAGGGTGACGCTGGTGGAAACAATACGCTGCGGCTGGCTTTCCAGTGTATGTGTGCCACGGCT * NH:i:4 AS:i:56 NM:i:6 MD:Z:29A5C8T8G14C11G5
D00352:469:CD2JDANXX:6:2211:12452:28036 153 gi|159032194 189508 255 5S20M100S * 0 0 AAGCGCACAATCCCTCCCCTCGCCCCTTTGGGGAGAGGGTTAGGGTGAGGGGAACAGCCAGCACTGGTGCGAACATTAACCCTCACCCCAGCCCTCACCCTGGAAGGGAGAGGGGGCAGAACGGC * NH:i:2 AS:i:20 NM:i:0 MD:Z:20
D00352:469:CD2JDANXX:6:2211:12452:28036 393 gi|850489648 73106717 255 100S20M5S * 0 0 GCCGTTCTGCCCCCTCTCCCTTCCAGGGTGAGGGCTGGGGTGAGGGTTAATGTTCGCACCAGTGCTGGCTGTTCCCCTCACCCTAACCCTCTCCCCAAAGGGGCGAGGGGAGGGATTGTGCGCTT * NH:i:2 AS:i:20 NM:i:0 MD:Z:20
The first four reads are read one (where the 2nd line is the reverse complement), and the last two refer to read 2. All of them - to some extent - are mapped. However, all the relevant fields of the SAM file suggest that no read has a mate mapping.
For example in column 7, we see an "*", and not the name of the contig on which the mate is mapping. Also, all the SAM flags in column 2 include the "mate not mapped" statement.
What I noticed is:
1) Reads that map on different contigs are returned as paired only if they are both unique, e.g. as the one posted below
D00352:469:CD2JDANXX:6:1109:5843:60327 81 gi|1216459868 318360 255 87S21M17S gi|1273279048 3829668 0 GGCGGTGGCAACGCCGGAAGGCTTAACAACGGTAGCAAGGTAATAAACGTGCCCGCCGCCGCCAGCCCGTAGTTC
D00352:469:CD2JDANXX:6:1109:5843:60327 161 gi|1273279048 3829668 255 125M gi|1216459868 318360 0 GTGGAGAGCAGCATCAATAGCCCTGGTCGCGCACTATGTGCCAGCTTCCCGCTGGTTAACGCACCAATAGCCGCGCCGAGCGG
2) Read mapping on the same contig (in this case it was because the reads were almost fully overlapped) are returned as mapping in pair also when they have multiple hits. See below.
D00352:469:CD2JDANXX:6:2215:4955:58969 99 gi|1250053806 2713762 255 100M25S = 4339545 1625908 TTATTGCCCGGGGATGGGGAGTAAGCTTAGTCCGCCGTTGTCTCCGTGTGTCGCGTGCGCGGTTGCACGTCAGTCTCAGACGAACCGATGA
D00352:469:CD2JDANXX:6:2215:4955:58969 147 gi|1250053806 4339545 255 125M = 2713762 -1625908 GAAAGCAATCAGCGATGGTGCTCTGACGGGTTCGAGTTCTGCTGTGATAACGGAGAGAGACTGCGTGTCACGTTCGCGCTGGA
D00352:469:CD2JDANXX:6:2215:4955:58969 353 gi|1347826457 3025751 255 100M25S = 76965 -2948661 TTATTGCCCGGGGATGGGGAGTAAGCTTAGTCCGCCGTTGTCTCCGTGTGTCGCGTGCGCGGTTGCACGTCAGTCTCAGACGA
D00352:469:CD2JDANXX:6:2215:4955:58969 401 gi|1347826457 76965 255 125M = 3025751 2948661 GAAAGCAATCAGCGATGGTGCTCTGACGGGTTCGAGTTCTGCTGTGATAACGGAGAGAGACTGCGTGTCACGTTCGCGCTGGACTGCTGTG
I already read the post by Pierre (http://plindenbaum.blogspot.com/2016/09/playing-with-magicblast-ncbi-short-read.html), and he had no problems with the number of paired reads (more than 90% were paired in his case). However, the reference that I used is very different, with a lot of small contigs (average length 3800bp) sometimes containing highly similar sequence, and the probability that the two reads of a pair map on the same contig is usually small.
Now, my questions:
1) Is this magic-blast behavior expected?
2) Am I doing something wrong?
3) Would you trust this results or go back to the good old blast? I am pretty confident that results are reliable, but maybe I am biased!