I want to know the copy number of all ORFs in my genome. Does it make sense to map genomic reads to these ORFs using either BWA or Bowtie, and then quantify FPKM values with cufflinks?
I expect most ORFs are in 1 copy, so they will have about the same FPKM. I can then normalize all FPKM values to the value corresponding to a single copy ORF, and see how many copies other ORFs have.
Does this make sense? Any pitfalls? should I use bwa or bowtie for this?
I think your strategy sounds similar to the strategy used in CoNIFER (on a BWA alignment), except that has an extra (important) step using SVD to correct for biases in coverage:
Either way, CoNIFER has an RPKM function, so that can potentially make your life easier. However, I'm not sure how comfortable I'd be with copy number calls on a single sample (where you can't apply the SVD step).
Conifer is the right tool to account for sequence alignment biases. Some regions are easier to generate reads to and will appear multicopy if you don't correct for this.
Firstly, unless you're working in a prokaryote, only looking at chromosome X or Y (or the equivalent for your organism), or working on single-cell sequencing, you should normally expect 2 copies of an allele.
Secondly, why would you want to shoe-horn cufflinks into an analysis for which there are already numerous pre-made programs? CNVnator is the first example that comes to mind, but there are a LOT of packages out there. Have a read through this paper for a relatively recent overview of what's out there.
You only expect 2 copies of an allele when the organism is diploid. Even if you are working on single cell sequencing, unless the cell is a gamete, you expect 2 copies if the organism is diploid.
However, the good news is that the 2 alleles don't have to be very divergent, so in most cases, if you adjust the settings on stringency in read mapping, reads from both alleles will map to any one of them, and thus that gene would be considered single copy.
Why do I want to use cufflinks? Because I already know what the input needs to be, the .bam file, and I know what the output will look like, my fasta headers which are ORFs together with FPKM values beside it.
The solution you proposed is very weak in documentation, and seems to measure copy numbers among specified regions. I am only interested in feeding it with the fasta file containing all ORFs and getting back the copy number.
Yeah, the single-cell thing was a silly mistake on my part (I had RNAseq on the brain, perhaps because of the mention of ORFs), mea culpa.
CNVnator was just an example of one solution, which happens to be geared more toward whole-genome analyses. Mapping reads arising from the whole genome (or even the whole exome or similar targeted regions) to only ORFs is not a great idea if that's what you're proposing doing (it'll lead to biased mappings). Normally you would just map genomic reads to the whole genome and call CNVs based on that (whether you end up using a genomic-window method like CNVnator or another is up to you).
Conifer is the right tool to account for sequence alignment biases. Some regions are easier to generate reads to and will appear multicopy if you don't correct for this.