I am attempting to analyze a stool sample of unknown composition for contents using Kraken. The data are paired end Illumina reads. I performed the following for my analysis:
Using bbtools I merged and deduped the reads:
bbmerge-auto.sh in1=R1.fastq.gz in2=R2.fastq.gz out=merged.fastq.gz outa=adapters.fa
dedupe.sh in=merged.fastq.gz out=deduped.fastq.gz
Following this I ran the merged files through kraken:
kraken --preload deduped.fastq.gz > merged.kraken
kraken-translate --mpa-format merged.kraken > merged.labels
The above kraken commands are just an example; I ran a second run of kraken without merging the reads:
kraken --preload R1.fastq.gz R2.fastq.gz > unmerged.kraken
kraken-translate --mpa-format unmerged.kraken > unmerged.labels
The length of the merged.fastq.gz file is 375700 lines or 93925 and Kraken was able to map 3930 reads or ~4.2%
The length of the separate paired end files is 582004 or 291002 total reads and Kraken mapped 150567 or ~51.7%
What could explain why the merged reads are mapping so poorly and the un-merged paired-end reads are not?
You should take a few of the reads and see what is happening with the merging process. Were these reads designed to merge? Have then been pre-scanned/trimmed before merging? Have you tried to use plain
bbmerge.sh
rather than the auto?So auto, from what I read, is just for the memory part. These are paired end reads 150bp each with a 250bp insert which means they are overlapping.
It appears that contrary to your expectation the insert sizes must not be what you expect. I suggest that you first try estimating that using one of the two methods from BBMap: C: Target fragment size versus final insert size
This is the result of running the memory intensive analysis:
That looks like a good merge to me, am I missing something?
This does look reasonable and in line with what one would expect. Then why are you missing so many reads in the merged file in stats you had posted above?
It appears my initial merge was using a less stringent merge, I've just redone the merge using that same script and got 512908 lines in the fastq file or 128227 reads.
I have now rerun the new merge2 file in kraken and that mapped only 1.93% of the merged reads. I'm really confused.
Have you trimmed the data after merge to remove adapter sequences that may be present? Run
bbduk.sh
on the merged data.I followed the standard adapter trimming code and it trimmed very few reads and after trimming still only mapped 1.93% of reads.
Based on the metrics above
bbmerge.sh
appears to be doing its job right. Have you taken a few merged reads and then blasted them at NCBI to see what you get? Are they mapping properly?BBMap suite does have some tools to assign taxonomy. Let me see if I can find the guide for that.
Edit: Take a look at
/bbmap/docs/guides/TaxonomyGuide.txt
included in BBMap distribution.I've redone kraken with trimmed and untrimmed reads and the mapping percentage is still very low if I trim, however when I analyze the results, 86% of the untrimmed reads map to "root" which is not even domain specific. Meanwhile 100% of the trimmed reads map to Bacteria, while this is only 1.93% of the total reads, it seems better than the untrimmed result.
It may be that many of the unmapped reads are shorter due to trimming and so kraken cannot find good matches, but they may be useful for an ultimate assembly and variant checking.
You could also try DIAMOND against NCBI
nr
(if you have enough compute resources available).Are you only analyzing a subset of reads or you did not get a lot of data?
This is the full set of reads, we got between 150,000 and 250,000 demultiplexed reads. This was spike-in data and it is mostly just a "what's there" type of thing. We are actually trying to do FACS on bacteria and wanted to test a computational method for making sure we have single bacteria, but the data I am working with was done using a dilution method, so there's likely several bacteria present.
I will definitely try DIAMOND to see how it performs.
If these are from spike-ins of known references (e.g. ones with genomes present) I would think you should be seeing pretty high recovery w/ just R1, let alone assembled pairs.
It's definitely worth cross-checking w/ DIAMOND though; this plus taxonomic annotation (and maybe using MEGAN to get a quick visualization of results) gives a pretty good overall picture on what is present.
A few questions:
1) What are the read lengths? 2) What is the expected insert size?
You are getting a pretty substantial drop in sequences when merging, which makes me wonder whether those that are merging are in poor quality regions.
Read lengths are 150bp and the expected insert is 250bp.