I'm building a de novo transcriptome assembly pipeline. One of the features I'm trying to implement is targeted assembly, where a user can provide reference FASTAs, and the pipeline will iteratively extract reads that map to it for separate assembly. For example, if a user is assembling a plant transcriptome and they provide reference chloroplast and mitochondrion genomes, the pipeline will map the RNA short-reads to the chloroplast and separate reads that align to their own file, and then do the same with the mitochondrion.
A problem I am having is that the mapper appears to not be specific enough. That is, after extracting reads which map to one organellar genome and then attempting to map the remaining reads to the other one, the mapper only returns a few (< 20) hits.
Is this low specificity something I can fix by tuning alignment parameters? Obviously there's some genetic overlap between chloroplast and mitochondrial DNA, but surely they can't be as similar as would be indicated by the mapping results.
For reference
- I am using STAR to index, map reads. Mostly default configuration.
- Reads are single or paired-end, and are thoroughly cleaned prior to mapping
- After mapping, 'extraction' occurs by parsing the resulting SAM file for unique hits
So you say "transcriptome" but then say users can provide reference "genomes". What exactly are you using for STAR; a genomic fasta + GTF subset to specific contigs? The whole genomic fasta with a subset GTF?
The devil's in the details here. STAR will attempt to align reads to the genome (including intergenic reads) if the transcriptome from the GTF does not contain a hit; so you could (and I suspect you do) lose reads in this fashion.
I would consider sticking all "references" together (lit: concatenate fasta and GTF); and then use Mapping Quality to determine 'unique' hits to one reference or another.
Sorry, realizing the ambiguity in the phrasing of my question. To be clear, mapping occurs prior to contig assembly, before anything with transcripts is done. When STAR is employed, I have a FASTQ of cleaned RNA short-reads/frags (100bp - 300bp) and the reference 'genome' FASTA to which I'm mapping them. I put 'genomic' in quotes because the FASTA is a big concatenated file containing all reference chloroplast genomes from Ensembl, similar to how your suggestion describes.
So I'm using STAR to try and identify short-reads that are, for example, from a chloroplast genome. Using the SAM it spits out, my program then moves the short-reads with a sufficient mapping score to a new FASTQ file containing only 'chloroplast' reads. This is my crude solution to targeted assembly.
The problem is that once I extract all 'chloroplast' reads from the short-read FASTQ, and then attempt to do another extraction using reference mitochondrion genomes, STAR appears to detect close to zero reads on the second go-around.
It's more than conceivable that at least one chloroplast genome in Ensembl has a mitochondrial contig in it, or mitochondrial contamination.
My recommendation above is to concatenate both chloroplast and mitochondrial fasta; map in multi-mapping mode, and assign reads based on hits + edit distance to chloroplast or mitochondrial sequence. This approach will force the aligner to choose competitively. Right now you could have low-homology chloroplast alignments pulling down what might be high-homology mitochondrial alignments.