Hello,
I have a conceptual question that I think I may have the answer to, but would appreciate feedback. There are many ORF-calling tools out there, and many of them take in a fasta file as input (such as orfipy). My question is: How do these tools use fasta files as inputs, if many times these files have reads from sequencing that have not yet been aligned?
Since we do not output fasta files after running aligners such as STAR, why do these tools take in fasta files, and not a bam file? I assume they need the aligned sequence (not just read), so that it can ORF call for an entire length of a gene.
Thus, if I want to use these tools, should I figure out how to take my aligned output (probably the bam file), and convert that into a fasta file where each line is no longer a read but a transcript? I believe the fasta file has to be a multi-fasta file, but when I google this format, it is not clear whether this is for storing aligned sequences, or just sequences from multiple fasta files.
Thank you!
Thanks for your response. So when we are aligning sequencing reads, are we creating an assembly? If yes, I assume then there's a way to turn alignment output back into fasta format?
Sequencing reads are typically aligned to either genome or transcriptome assemblies. Assemblies already are FASTA files, so there is no need for conversion. Instead, we predict ORFs directly from those assembly files, and the alignment step is unnecessary.
Genomic assemblies are representations of genomic DNA sequences. The assemblies can be complete or not. Aligning sequencing reads to assemblies is done for different reasons, but most of them have nothing to do with ORF prediction.
I see. My rationale for wanting to use my aligned reads is because I wanted to predicted ORFs on the real reads from my data (in case there are mutations), not the mapped assemblies. But given your answer, it seems like there'd be an insignificant difference. Thank you!