Question

BBMap Excess Fold Coverage

0

Entering edit mode

5 months ago

shacon • 0

Hello all,

I am currently trying to determine the abundance of phages on the metagenomic data. I used the viral sequence identification tools VirSorter2 and metaviralSPAdes on the metagenomic data. I ran CheckV on my putative viral contigs and selected the highest-quality sequences. Following this, I clustered my sequences in vOTUs utilizing the rapid genome clustering based on pairwise ANI found on the CheckV bitbucket page at a 95% ANI. For the abundance generation, I utilized BBMap with the following options:

bbmap.sh ref=/storage/home/vOTUs.fasta \
  in=/storage/home/fastq/${SAMPLE_ID}_1.fastq.gz \
  in2=/storage/home/fastq/${SAMPLE_ID}_2.fastq.gz \
  rpkm=/storage/home/bbmap/${SAMPLE_ID}_rpkm.tsv \
  threads=${CPUS} \
  ambiguous=random \
  -Xmx16g \
  overwrite=t

Following this I was met with bizarre results from the mapping. For reference, the fastq files I used contained ~ 60,000,000 - 120,000,000 reads. I had some of my viral sequences (~100,000 - 290,000 bp in length) mapped by tens of millions of reads (As high as 33,780,491 for a particular sample that only had 49,049,623 mapped reads) with fold coverage values in the tens of thousands. My question is: should I adjust the minimum identity of BBMap to 95% rather than the default 76% to deal with this excess read mapping? Run it in semiperfect mode? Is it possible that CheckV failed to remove microbial contamination from the viral sequences and are thus being mapped to microbial genes within the viral sequences?

Also, each tool was processed independently of the other, and the mapping results were more or less the same.

bbmap • 1.2k views

ADD COMMENT • link updated 4 months ago by GenoMax 147k • written 5 months ago by shacon • 0

1

Entering edit mode

Side note: The parameter ambiguous is misspelled and bbmap ref says the param name is ambig anyway. How is this not causing an error in usage?

ADD REPLY • link 5 months ago by Ram 44k

1

Entering edit mode

The actual line of code is:

}else if(a.equals("ambiguous") || a.equals("ambig")){

...but "ambigous" definitely won't work.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

Sorry, this code was copied from the notebook I am keeping and not the shell script I ran.

ADD REPLY • link 5 months ago by shacon • 0

1

Entering edit mode

Normally for quantification I prefer to use Seal (seal.sh) as it is faster in this kind of situation (reads multi-aligning to extremely large numbers of different ref sequences) and has more precise control over ambiguity. However, BBMap should still be producing correct output, and not multi-mapping reads with ambig=random. If you output "covstats", at least, the sum of the average fold coverage times the respective length of the contigs should equal the total number of aligned bp of reads; as long as that is true, the results should be accurate even if they seem odd. Have you tried visualizing any of this in IGV or a similar viewer to see what's going on?

The minimum aligned identity shouldn't matter much here; a read that aligns at 95% ID to one location will not be considered ambiguous and instead assigned to an alternate location where it aligns at 93% ID.

Oh, there is one thing you may want to consider, "delcov". This is by default true - meaning if a 150bp read aligns with a 10kb deletion, it is considered as covering 10150 bp (the entire region it spans), which can greatly increase the apparent coverage. You can set "delcov=f" to consider bases as covered only if a query base is over them instead of a deletion event. This mainly affects RNA-seq data (due to introns) but I can see why viral data might be affected too. So the test of summing coverage*sequence length can give very different results depending on that flag. You can also set maxindel=3 or something like that to prevent very long deletions from being included in alignments.

ADD REPLY • link 5 months ago by Brian Bushnell 20k

0

Entering edit mode

Hello,

When you say the sum of the average fold coverage, do you mean the sum of the average fold across all sequences in my reference file? Or is it the fold coverage specific to that entry? When I perform the calculation using the sum of the average fold across all sequences in the reference, I get a 2.0x10^9 bp difference between the read counts. But when I use the average coverage value for the specific sequence, they are the same.

Additionally, I visualized the mapping in IGV with one of the vOTUs that had the greatest coverage counts and was met with some surprising results: insert sizes on the order of magnitude of 10^4 for many of the reads. When examining individual reads I noticed that the string "XT" was included in the BAM description of the alignment. I am not entirely sure what this means, but I think there are still multi-mapped reads in the alignment. Additionally, I noticed that many of the reads contained the same positional indels/mismatches. Maybe this could be a sign of erroneous mappings?

enter image description here

I also ran BBMap again with stricter parameters, which included the additional options:

ambig=random \
delcov=f \
minid=0.95 \
pairedonly=t

The result was lower quantities of read mappings and an approximate halving in fold coverage, but still have pretty high read counts in the tens of millions for a single sequence.

I have yet to run seal.sh but I am planning to.

ADD REPLY • link 5 months ago by shacon • 0

0

Entering edit mode

Is the exact SAM tag XT:A:M? @Brian will need to chime in on the rest.

ADD REPLY • link 5 months ago by GenoMax 147k

0

Entering edit mode

Actually the tag is XT = R, so I am not too sure what that means. I can't find any description of it in the SAM/BAM documentation. Here is the description taken from one read with that XT tag:

Read name = ${SAMPLE_NAME}.50595111 50595111 length=125
Read length = 125bp
Flags = 163
----------------------
Mapping = Primary @ MAPQ 2
Reference span = vOTU_mvs_194:1,583-1,712 (+) = 130bp
Cigar = 95=4I1X5=5X1=1X1=9D1X2=2X3=2X2=
Clipping = None
----------------------
Mate is mapped = yes
Mate start = vOTU_mvs_194:1566 (-)
Insert size = -146
Second in pair
Pair orientation = R1F2
----------------------
AM = 2
NM = 25
XT = R
Location = vOTU_mvs_194:1,651
Base = A @ QV 37

I've tried going through the SAM/BAM documentation, but there doesn't seem to be a reference to the XT, other than this forum post. If I had to guess it is indicating a repeat alignment, but with ambig=random this shouldn't have been possible?

ADD REPLY • link 5 months ago by shacon • 0

1

Entering edit mode

It has been some time since the last exchange here but @Brian (author of BBMap) provided the following explanation for the XT tag.

"XT:A:R" is added for reads that are ambiguously aligned and assigned to a random location.

ADD REPLY • link 4 months ago by GenoMax 147k

0

Entering edit mode

For further information:

Reads
Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0        0        6        7024    207    33780491 
Coverage
Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.00     0.00     0.01    10.38     0.50 44226.74

ADD REPLY • link 5 months ago by shacon • 0

0

Entering edit mode

I had some of my viral sequences (~100,000 - 290,000 bp in length) mapped by tens of millions of reads (As high as 33,780,491 for a particular sample)

How many sequences are in your reference? Since you are using ambig=random each read pair should map to one place per reference. My guess it you have millions of "references" so you get those high hit numbers. If you want to see more stringent alignments (perfect matches) then you may actually want to use perfect mode.

ADD REPLY • link 5 months ago by GenoMax 147k

0

Entering edit mode

The metaviral reference should be ~900 sequences and the VirSorter2 reference should be ~1300 sequences. Additionally, no special library preparation of the samples was performed, so the viral sequences that were identified from the MAGs would positively be outnumbered by the microbial sequences present in the sample. Thus there shouldn't be millions of reads mapping to them.

ADD REPLY • link 5 months ago by shacon • 0

0

Entering edit mode

Just to be certain. Where did the stats that you posted above come from? Are those from RPKM values or mapped reads?

ADD REPLY • link 5 months ago by GenoMax 147k

0

Entering edit mode

Those are the mapped reads for the meatviralSPAdes viral sequences, obtained from importing ${SAMPLE_ID}_rpkm.tsv into R, merging them, and running the base R summary(). Here are the RPKM/FPKM values as well

RPKM
Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000     0.000     0.025    13.370     0.772 16860.519 
FPKM
Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000     0.000     0.022    13.368     0.769 16860.521

Attached is a basic boxplot of the number of reads mapped for each sample processed. The X-axis is just a sample ID, withheld due to identifiable info. There should be ~900 entries per sample. enter image description here

ADD REPLY • link 5 months ago by shacon • 0