Clarification on conceptual question regarding ORF-calling
1
0
Entering edit mode
19 months ago
Daniel ▴ 30

Hello,

I have a conceptual question that I think I may have the answer to, but would appreciate feedback. There are many ORF-calling tools out there, and many of them take in a fasta file as input (such as orfipy). My question is: How do these tools use fasta files as inputs, if many times these files have reads from sequencing that have not yet been aligned?

Since we do not output fasta files after running aligners such as STAR, why do these tools take in fasta files, and not a bam file? I assume they need the aligned sequence (not just read), so that it can ORF call for an entire length of a gene.

Thus, if I want to use these tools, should I figure out how to take my aligned output (probably the bam file), and convert that into a fasta file where each line is no longer a read but a transcript? I believe the fasta file has to be a multi-fasta file, but when I google this format, it is not clear whether this is for storing aligned sequences, or just sequences from multiple fasta files.

Thank you!

ORFIPY ORF • 1.0k views
ADD COMMENT
0
Entering edit mode
19 months ago
Mensur Dlakic ★ 28k

Most ORF tools use assemblies, where individual sequencing reads are joined by overlaps into large contigs. There isn't enough length in short reads to predict ORFs. As to transcripts, they may be without start or stop codons, or lacking introns in eukaryotes.

ADD COMMENT
0
Entering edit mode

Thanks for your response. So when we are aligning sequencing reads, are we creating an assembly? If yes, I assume then there's a way to turn alignment output back into fasta format?

ADD REPLY
1
Entering edit mode

Sequencing reads are typically aligned to either genome or transcriptome assemblies. Assemblies already are FASTA files, so there is no need for conversion. Instead, we predict ORFs directly from those assembly files, and the alignment step is unnecessary.

So when we are aligning sequencing reads, are we creating an assembly?

Genomic assemblies are representations of genomic DNA sequences. The assemblies can be complete or not. Aligning sequencing reads to assemblies is done for different reasons, but most of them have nothing to do with ORF prediction.

ADD REPLY
0
Entering edit mode

I see. My rationale for wanting to use my aligned reads is because I wanted to predicted ORFs on the real reads from my data (in case there are mutations), not the mapped assemblies. But given your answer, it seems like there'd be an insignificant difference. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6