I have thousands of bacterial sequences (fastq format) from many different projects that I am trying to analyze together. This includes a mix of 454, single and paired Illumina data. In order to compare amongst them I have identified numerous overlapping regions by mapping them onto a single reference sequence (in Geneious). However, I am now looking for software that can help me easily align and trim these sequences down to these exact regions. I need to be able to output the trimmed sequences as fastq files for downstream analyses (DADA2).
Does anybody know of any program that can help (limited to work only on a Mac).
Thanks very much!
Take a look at programs mentioned in this answer: A: How to clean multiple protein sequences alignement in order to make a phylogenic See if they work on a mac.
Thanks for the reply, the three programs listed in here MEGA, TrimAI and Gblocks all seems to only use Fasta file format whereas all of my files are FastQ format.
For that to work you are going to need to convert your data to aligned fasta format, unless Geneious can export a consensus based on the alignments you already have (not sure if you want to do that).
Thanks again for the reply. The problem is that I need to process the sequences through DADA2, which requires the quality scores maintained in the fastq format. If I export to FASTA I lose them, and my ability to process the sequences. I can align, edit and trim the sequences within Geneious, but it only allows the export of FASTA files unfortunately.
If you are planning to use DADA2 then that is a separate application. I don't know if you added this information afterwards to the original question.
Don't think DADA2 requires trimmed sequences. Only requirement noted:
Samples have been demultiplexed, i.e. split into individual per-sample fastq files.
Non-biological nucleotides have been removed, e.g. primers, adapters, linkers, etc.
If paired-end sequencing data, the forward and reverse fastq files contain reads in matched order.
You're correct, I added it in afterward you commented to be more specific. I have already processed the untrimmed sequences through DADA2 for each project separately. What I was hoping to do was to combine the data from multiple studies that used different primers, with the goal of finding bacteria common to multiple projects. However, in order to identify any, the FASTQ sequences need to start from the exact same location/base pair, which was why I mapped them onto a reference sequence in Geneious, in order to determine potential overlapping regions.
Thanks for your help.
I doubt there is anything off the shelf that would allow you to trim the reads based on your alignment. You may need to parse reach read alignment from the file and then use the CIGAR info to trim the original reads. This will not be a trivial undertaking.
I guess DADA2 is not allowing you to mix data from different sequencers? Otherwise as long as it meets the requirements above you could try that?
Take a look at http://guidance.tau.ac.il/ to see if it works for what you need.
Thanks very much for your reply, but this website only takes fasta sequences, all of mine are fastq format.