Hello everyone! I am performing a de novo genome assembly of a Prunus spp. chloroplast, starting from SRA datasets.
My sequencing data is from Illumina HiSeq 2500 paired-end sequencing and ONT performed on the same species.
The final goal is to evaluate the performance of different assembly strategies and to get the best assembly. Our instructor told us to extrapolate the chloroplast reads of the sequencing data by mapping them to the chloroplast genome of a single spp belonging to the same genus. However, I realized that in some cases there is structural variation even in species belonging to the same genus. Therefore, in order to avoid biasing the extrapolation of the reads, I decided to map the fastq data against more than one reference. Thus I mapped my data with Bowtie2 using an index based on 10 chloroplast genomes of Prunus spp (I choose the most related ones based on phylogenetic studies and data availability). After this procedure, I got a good number of mapped reads, approx 3'200'000, which means an estimated coverage of x4800.
Here's the big question: I would like to use these mapped PE Illumina reads to perform scaffolding (using ABySS) and error correction of long reads (ONT). I am not sure how to deal with paired-end reads that have been mapped to discordant chromosomes. They account for 10% of the total PE-reads mapped. Do you think they might interfere with the downstream processes?
Thank you for your attention,
Eisuan
Thanks a lot for your suggestions! I'll definitely give NOVOPlasty a try.
For what regards the mapping with bowtie: do you think I should perform read mapping again using a single species and the local mode? Or apply it just using the local mode and my multiple references?
I recommend to combine all references in one FASTA file and map reads to it. Also, I recommend to use BWA-MEM instead of Bowtie2, because BWA-MEM is slightly more accurate.
Perfect! My references are concatenated in the same FASTA file yet. Now I am gonna re-run Bowtie2 using the local mode. Then, with my output, should I keep all the mapped pairs or just the ones mapped concordantly to the same reference (thus chr)?
Which SAM flag terms would you use for filtering the reads from the output? I previously used -f1 -F12. Should I change f1 to f3?
(I am sorry for all these trivial questions. I am a Master's student with no prior experience in the matter and I would like to better understand the topic)
I think, it's better to take reads that align discordantly too, because if your species has a structural variant that no reference species has, reads around this variant may always align discordantly.
Instead of using SAM flags, I think it's simpler to use the following procedure: map only first reads from read pairs and get all mapped reads using the option --al of Bowtie2. Then get second reads that correspond to these first reads using a custom script.
Anyway, for a reference-based assembly it would be even simpler to use NOVOPlasty or GetOrganelle.