Hello !
I am currently working on transferring the metabarcoding analysis I used to do on a miseq 2x250 to a novaseq 2x150 to take advantage of the much better prices we get per reads on the latter platform.
My goal is to identify plant, bacterial or fungi species (depending on the used primers) present in my samples. As I am used to work with amplicons that are 300 to 500pb, I will encounter a merging problem as I won't have overlaps anymore.
In order to deal with this I have thought of doing that : 1-reverse complement R2 2-trim the 3' end of R1 and R2 3- concatenate both reads using fastq_join from usearch. Then I would Blast my resulting reads using Usearch with the -lopen
parameter (to cancel gap penalty) in order to identify species present in my sample.
Do you think it might work?
I am now wondering if there might not be better options to do such things, that's why I am asking for help. Here is what I thought of :
Alternative 1 : Are there any function that can use read pairs (R1 + R2 files) instead of merged paired reads to associate reads to their corresponding taxonomy?
Alternative 2 : map the R1 and R2 reads to a reference sequence in order to concatenate them with the right number of Ns in between before using usearch for example.
Thank you very much in advance for your help!
Adrien
You could have a look at BBSplit, which is part of the BBTools (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/), for taxonomy identification of the paired-end reads.
It performs a binning of the reads, according to the reference a read maps best to.
With something like
you will get 2 fastq files for each of the provided references, containing the reads that matched best.
I am not sure I understood what more does this program than comparing R1 and R2 separately...
It does not consider R1 and R2 separately, paired reads are kept together during binning. If you provide it with paired-end reads it will map them as paired reads to all of the provided references and determine for each, which of the reference they map best to. According to this mapping to multiple references, the paired reads are split into bins with the best-matching reference (whereby paired reads are always kept together). I was referring to your Alternative 1, because it associates the reads individually with one of the provided reference. Of course, this is only helpful if you have some candidates with reference genomes you can provide. In case you need the associated reads for each species this one is preferable.
Otherwise, if your goal is to identify species that are present in your sample from a database of taxa I can recommend kraken2 (https://github.com/DerrickWood/kraken2/wiki). It uses databases to perform taxonomic classification to assign labels to sequencing reads, designed for metagenomics analyses. You can provide it directly with the paired-end reads, without the concatenation with Ns between the reads and the pairing information is still considered internally. In combination with Bracken (https://ccb.jhu.edu/software/bracken/) it can also be used to produce abundance estimates. Using the --classified-out and --unclassified-out flags you can also save the reads according to classification status, for later processing.
thank you Lily, I will try what you propose :)