Hello all,
I am currently trying to determine the abundance of phages on the metagenomic data. I used the viral sequence identification tools VirSorter2 and metaviralSPAdes on the metagenomic data. I ran CheckV on my putative viral contigs and selected the highest-quality sequences. Following this, I clustered my sequences in vOTUs utilizing the rapid genome clustering based on pairwise ANI found on the CheckV bitbucket page at a 95% ANI. For the abundance generation, I utilized BBMap with the following options:
bbmap.sh ref=/storage/home/vOTUs.fasta \
in=/storage/home/fastq/${SAMPLE_ID}_1.fastq.gz \
in2=/storage/home/fastq/${SAMPLE_ID}_2.fastq.gz \
rpkm=/storage/home/bbmap/${SAMPLE_ID}_rpkm.tsv \
threads=${CPUS} \
ambiguous=random \
-Xmx16g \
overwrite=t
Following this I was met with bizarre results from the mapping. For reference, the fastq files I used contained ~ 60,000,000 - 120,000,000 reads. I had some of my viral sequences (~100,000 - 290,000 bp in length) mapped by tens of millions of reads (As high as 33,780,491 for a particular sample that only had 49,049,623 mapped reads) with fold coverage values in the tens of thousands. My question is: should I adjust the minimum identity of BBMap to 95% rather than the default 76% to deal with this excess read mapping? Run it in semiperfect mode? Is it possible that CheckV failed to remove microbial contamination from the viral sequences and are thus being mapped to microbial genes within the viral sequences?
Also, each tool was processed independently of the other, and the mapping results were more or less the same.
Side note: The parameter
ambiguous
is misspelled and bbmap ref says the param name isambig
anyway. How is this not causing an error in usage?The actual line of code is:
...but "ambigous" definitely won't work.
Sorry, this code was copied from the notebook I am keeping and not the shell script I ran.
Normally for quantification I prefer to use Seal (seal.sh) as it is faster in this kind of situation (reads multi-aligning to extremely large numbers of different ref sequences) and has more precise control over ambiguity. However, BBMap should still be producing correct output, and not multi-mapping reads with ambig=random. If you output "covstats", at least, the sum of the average fold coverage times the respective length of the contigs should equal the total number of aligned bp of reads; as long as that is true, the results should be accurate even if they seem odd. Have you tried visualizing any of this in IGV or a similar viewer to see what's going on?
The minimum aligned identity shouldn't matter much here; a read that aligns at 95% ID to one location will not be considered ambiguous and instead assigned to an alternate location where it aligns at 93% ID.
Oh, there is one thing you may want to consider, "delcov". This is by default true - meaning if a 150bp read aligns with a 10kb deletion, it is considered as covering 10150 bp (the entire region it spans), which can greatly increase the apparent coverage. You can set "delcov=f" to consider bases as covered only if a query base is over them instead of a deletion event. This mainly affects RNA-seq data (due to introns) but I can see why viral data might be affected too. So the test of summing coverage*sequence length can give very different results depending on that flag. You can also set maxindel=3 or something like that to prevent very long deletions from being included in alignments.
Hello,
When you say the sum of the average fold coverage, do you mean the sum of the average fold across all sequences in my reference file? Or is it the fold coverage specific to that entry? When I perform the calculation using the sum of the average fold across all sequences in the reference, I get a 2.0x10^9 bp difference between the read counts. But when I use the average coverage value for the specific sequence, they are the same.
Additionally, I visualized the mapping in IGV with one of the vOTUs that had the greatest coverage counts and was met with some surprising results: insert sizes on the order of magnitude of 10^4 for many of the reads. When examining individual reads I noticed that the string "XT" was included in the BAM description of the alignment. I am not entirely sure what this means, but I think there are still multi-mapped reads in the alignment. Additionally, I noticed that many of the reads contained the same positional indels/mismatches. Maybe this could be a sign of erroneous mappings?
I also ran BBMap again with stricter parameters, which included the additional options:
The result was lower quantities of read mappings and an approximate halving in fold coverage, but still have pretty high read counts in the tens of millions for a single sequence.
I have yet to run
seal.sh
but I am planning to.Is the exact SAM tag
XT:A:M
? @Brian will need to chime in on the rest.Actually the tag is
XT = R
, so I am not too sure what that means. I can't find any description of it in the SAM/BAM documentation. Here is the description taken from one read with that XT tag:I've tried going through the SAM/BAM documentation, but there doesn't seem to be a reference to the XT, other than this forum post. If I had to guess it is indicating a repeat alignment, but with
ambig=random
this shouldn't have been possible?It has been some time since the last exchange here but @Brian (author of BBMap) provided the following explanation for the
XT
tag.For further information:
How many sequences are in your reference? Since you are using
ambig=random
each read pair should map to one place per reference. My guess it you have millions of "references" so you get those high hit numbers. If you want to see more stringent alignments (perfect matches) then you may actually want to use perfect mode.The metaviral reference should be ~900 sequences and the VirSorter2 reference should be ~1300 sequences. Additionally, no special library preparation of the samples was performed, so the viral sequences that were identified from the MAGs would positively be outnumbered by the microbial sequences present in the sample. Thus there shouldn't be millions of reads mapping to them.
Just to be certain. Where did the stats that you posted above come from? Are those from RPKM values or mapped reads?
Those are the mapped reads for the meatviralSPAdes viral sequences, obtained from importing
${SAMPLE_ID}_rpkm.tsv
into R, merging them, and running the base Rsummary()
. Here are the RPKM/FPKM values as wellAttached is a basic boxplot of the number of reads mapped for each sample processed. The X-axis is just a sample ID, withheld due to identifiable info. There should be ~900 entries per sample.