This is my first time working with RNA data. So far, I've run the dataset through trimmomatic, sortmeRNA, megahit and BWA. I'm trying to run Salmon using the SAM file output from BWA and the .fa resulting from megahit as the transcriptome.
I run this line:
./salmon-1.8.0_linux_x86_64/bin/salmon quant -p 12 -t Sample1_megahit.contigs.fa -l A -a Sample1_megahit.annotation_bwa.sam -o Sample1_salmon
I get this error at the end of a stream of lines that all say a variation of "this transcript not found in reference" and I'm not sure what reference it's referring to: Please provide a reference FASTA file that includes all targets present in the BAM header.
Should I be passing the unassembled transcriptome from before megahit or something? The megahit file filtered to have only transcripts that had successful BWA alignments? I'm not sure how to do that. The data was originally paired end, if that is relevant.
Thanks in advance!
Which fasta did you use for BWA alignment?
For BWA, I ran:
bwa index -a bwtsw microbial_all_cds.fasta
Followed by:
bwa mem -t 32 microbial_all_cds.fasta Sample1_megahit.contigs.fa > Sample1_megahit.annotation_bwa.sam
Salmon is telling you that the names of the contigs in
Sample1_megahit.contigs.fa
are not the same as the names of the contifg inSample1_megahit.annotation_bwa.sam
What could be causing this? Could it be because there are many contigs in Sample1_megahit.contigs.fa that didn't get annotated, and therefore aren't present in Sample1_megahit.annotation_bwa.sam?
In your comment to @Shred you say that you aligned to
microbial_all_cds.fasta
. If that's the case, then you must passmicrobial_all_cds.fasta
to salmon.Where? If I run
Then I encounter errors that read " Transcript appears twice in the transcript FASTA file" and "Transcript appears in the reference but did not appear in the BAM."
This means that you have multiple entires in your FASTA file that have the same name, which isn't allowed.
How do I resolve this? Is there some way to remove duplicates? I don't think I had a specific step to dereplicate sequences in my pipeline, actually. Could that have caused this problem?
Are you sure
megahit
is appropriate to use with RNAseq data? It appears to be a genome assembler.I have seen it used in a few metatranscriptomics studies, but if it's likely to be the cause of this issue, I can try a different assembler.