Hi,
I'm interested in gene copy number and normally I calculate read coverage after aligning the reads to a reference, assembling the genome (using said reference), and compare coverage across gene A, versus coverage across gene B.
But I wondered if the process can be speeded up: Find the number of reads containing parts of gene Ain the contigs, without having to align to a reference and/or calculate coverage. If I only cared if genes A and B are in the data, Is there any software out there to help do this at all? (haven't found any so far). I'm in a case where someone sequenced the whole genome using illumina only to find in genes* A and B are present in the genome and their copy number. Going back to sequencing is out of the question (and out of my hands).
Cheers
- Obviously is not just two genes, but a list of genes of interest I need to find across some humongous number of genomes.
It's unclear what the question is that's you're attempting to answer. Some more clarity around why you want to know read abundance across many genomes, and what data you're using for this.
Are you using an existing public dataset? How related are the genomes? Are you trying to detect the presence of particular genes across a number of species? How are you planning to handle the fact that the sequence of genes in your sample species are different from your canonical reference? The align-to-nearby-reference->assemble->identify genes pipeline is designed to. If you want skip doing that, what level of gene homology do you require between samples/species?
If you already know your genes are essentially identical, then you can just align directly to your known genes or even do something like using Kraken2 where instead of identify which taxa a read belongs to, you use a custom database of all genes of interest in all the species you know the gene sequence for and classify reads directly to the genes.
If the genes are possibly more evolutionarily distant, then you need a way to identify them (hence the assembly pipeline you mentioned).