Question

mapping metagenomics samples to multiple references genome

2

Entering edit mode

9.2 years ago

Quak ▴ 520

I have two simple questions regarding mapping metagenomics samples to multiple reference. Sorry that they are too basic,

1) From what I have seen in different alignment tools (bwa, soap, mosaik ...) the argument for reference ask for a single fasta file; I wonder, how should one feed these tools when the reference are multiple organisms ? (BTW, which alignment do you recommend for bacteria genomes ?!)

2) for each sample, I have a set of pair-end reads as well as single reads corresponding to different sequencing runs. again, since above-mentioned tools either ask for one single-end read or two mating pairs, what should be my input ?! should I a) pull all reads into 1 huge fasta files ? b) pull all forwards and reverse into 2 big forward.fq and reverse.fq file and then map ?(how about single reads? c) should I run each pair of reads separately and then combine the BAM File afterward ?!

Thanks and sorry for trivial questions

metagenomics alignment sequencing • 4.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by Quak ▴ 520

Ram · Answer 1 · 2015-10-14

1

Entering edit mode

9.2 years ago

Brian Bushnell 20k

You can concatenate references together into a big file for mapping. Alternatively, if you want to separate the reads based on which organism they map to, you can keep the references separate and map using BBSplit or Seal, which accept multiple references as arguments.

You should map in two phases - one with all the paired reads, and one with all the single-ended reads. For example, you can concatenate all the forward reads into a single file, and all of the reverse reads into another single file (in the same order), then use those two files as input for the paired mapping.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

1

Entering edit mode

If I concatenate references into a big file, would I have this information in my final alignment ? can you also point me to the BBSplit publication ? Thanks.

ADD REPLY • link 9.2 years ago by Quak ▴ 520

1

Entering edit mode

BBSplit is not published, but the usage is described in this thread.

If you merge references together, no information is lost as long as all of the sequences have unique names.

ADD REPLY • link 9.2 years ago by Brian Bushnell 20k

1

Entering edit mode

Thanks, does it use a global alignment (Needleman-Wunsh) or a local alignment ? Do you think, it would make a difference in case of bacterial genome ?

ADD REPLY • link 9.2 years ago by Quak ▴ 520

1

Entering edit mode

BBSplit uses global alignments, and yes, it does make a difference - but as for which is better, that depends on the specifics of the situation. I generally favor global alignments but neither is universally better.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

score 1 · Answer 2 · 2015-10-15

You might want to consider the RTG metagenomics tools (they are free for non-commercial use as part of RTG Core Non-Commercial). The approach that RTG takes is to have a reference database comprised of all the species you are interested in, and that reference database can also include the taxonomic relationships between the reference species (RTG provides a standard pre-built species reference, and also includes tools for subsetting by taxon, extracting individual species genomes, building your own species reference from scratch, etc). Alignment and composition analysis are then performed with respect to that species reference, and can incorporate the taxonomic information into the analysis. In your case, you would align in two passes, once for the paired-end reads, and once for the single-end reads.