I have illumina paired-end whole-genome sequencing reads which I have map to around -400 reference plastid genomes. After getting mapped reads, I have to assemble as de novo plastid genome.
1. Do I have map reads to invidiual reference genomes one by one, or can I download all genomes at one go and index as one reference genome?. Do bwa or bowtie has
enough memory to index 400 genomes as one reference index genome?
2. Do you think which one is best method?. Mapping individual genome or all genomes indexed as one?
3. If I have to map individually, can I combine all bam file together and Can I convert to fastq file using bam2fastq tool (in picard) for denovo assembly?
I have paired-end genomic reads obtained from Hiseq 2000. My aim is to extract plastid reads from genomic reads by mapping to plastid genomes. Since I don't know which one is close reference to my sample. I am planning to map my reads to all plastid genome available in NCBI organelle genome resource (https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?opt=plastid&taxid=33090). There by I can extract plastid reads from my genomic reads. These extracted plastid reads obtained in bam format will be convereted to fastq file for separate plastid denovo assembly.
I don't fully understand your introductory sentence as stated, but I can answer your questions to the best of my understanding:
You can concatenate all of your genomes into one large file and then index that composite genome. Just make sure your naming convention for each component in the FASTA is logical so that you can understand your results downstream. bwa and bowtie don't "have their own memory," but if you're using a 64-bit system you shouldn't run into any size limitation troubles, especially with plastid sequences.
EDIT: I say map to a single composite reference. You can pass parameters into bowtie to limit your mappings on the front end, thus saving computational time and making it easier to isolate the most accurate mappings for each read.
You can merge BAM files using samtools merge, but of course you wouldn't need to if you proceed according to my recommendation. There are a number of tools from converting from BAM back to FASTQ, and the Picard tool should work just fine. It does have trouble with paired-end mappings in certain circumstances though, and if you run into trouble using Picard I'd suggest bedtools bamtofastq.
Is the idea here that you're going to map to a bunch of plastid reference sequences from various organisms, and then convert the aggregate mappings back to FASTQ and perform an assembly from them? If so, I say concatenate the reference sequences into one file and map against that. That way it will be easier to retain the best mappings up front, especially if you don't care about which reference you're mapping to.
Thanks for your answer. I can download all genome as one file through batchentrez . Can I make this one single reference genome (plastid.fa). Then I can map reads to these plastid.fa. I can get mapped reads using samtools -F option. Sorry, I don't understand "You can pass parameters into bowtie to limit your mappings on the front end, thus saving computational time". How can I do that?
What I mean is that you can supply specific instructions to bowtie about how it should handle the reads. I hate to sound trite, but definitely take a look at the manual. Bowtie is an incredibly powerful piece of software, and if you understand the variety of options you have at your disposal, you can save yourself a lot of time during downstream filtering and processing of the BAMs. For example, you can use the -k parameter to restrict your mappings to the best k. Or you can choose a random subset M of equally-quality mapped reads. Or you can use --strata to select all of the reads that had the same top mapping score.
If you run bowtie with the default parameters, you might end up with a bunch of mappings that you will want to discard later. It's much easier to remove them up front via bowtie parameters, and in some cases it will take substantially less time to create the BAMs.
ADD REPLY
• link
updated 5.2 years ago by
Ram
44k
•
written 10.7 years ago by
Dan D
7.4k
I'm a little confused by your strategy, not sure why you are mapping and then "after getting mapped reads" you "have to assembly as de novo plastid genome" -- what are you trying to do exactly? Map to existing resources or plastid genome assembly? It sounds like you are confused.
You can indeed use bwa-mem or bowtie to map to a large reference, but I am not sure if this is what you want to do with your data. I'm just confused by your strategy. You want to assembly the bam files into what?
I have paired-end genomic reads obtained from Hiseq 2000. My aim is to extract plastid reads from genomic reads by mapping to plastid genomes. Since I don't know which one is close reference to my sample. I am planning to map my reads to all plastid genome available in NCBI organelle genome resource (https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?opt=plastid&taxid=33090). There by I can extract plastid reads from my genomic reads. These extracted plastid reads obtained in bam format will be convereted to fastq file for separate plastid denovo assembly.
Thanks! Much more clear now.
I would pick your "close reference" by other means -- such as phylogenetics -- but let us know how your strategy goes.