Hi all,
Thank you so much for the suggestions!
I initially considered assembling the genome, but since I only have short-read data, I thought it might be tricky. Instead, I used a mapping-based approach where I created a "custom reference genome" by adding my three heterologous genes as separate contigs. My workflow looks like this:
# Step 1: Index the custom reference genome
bwa index custom_reference.fasta
# Step 2: Align paired-end reads to the reference genome
bwa mem custom_reference.fasta R1.fq R2.fq > aligned.sam
# Step 3: Convert SAM to sorted/indexed BAM
samtools view -Sb aligned.sam > aligned.bam
samtools sort aligned.bam -o aligned.sorted.bam
samtools index aligned.sorted.bam
# Step 4: Calculate per-base coverage
samtools depth aligned.sorted.bam > depth.txt
# Step 5: Calculate coverage for specific regions
bedtools coverage -a heterologous_genes.bed -b aligned.sorted.bam > heterologous_coverage.tsv
bedtools coverage -a single_copy_genes.bed -b aligned.sorted.bam > single_copy_coverage.tsv
Finally, I used a small Python script to estimate copy number by taking the average coverage of each heterologous gene and dividing it by the mean coverage of known single-copy endogenous genes:
heterologous_file, single_copy_file = sys.argv[1], sys.argv[2]
colnames = ["chr", "start", "end", "gene", "bases_covered", "length", "region_length", "coverage_fraction"]
hetero_df = pd.read_csv(heterologous_file, sep="\t", header=None, names=colnames)
single_df = pd.read_csv(single_copy_file, sep="\t", header=None, names=colnames)
hetero_mean_cov = hetero_df.groupby("gene")["bases_covered"].mean()
single_mean_cov = single_df.groupby("gene")["bases_covered"].mean()
mean_single_cov = single_mean_cov.mean()
copy_number_df = pd.DataFrame({
"heterologous_gene": hetero_mean_cov.index,
"avg_coverage": hetero_mean_cov.values,
"estimated_copy_number": hetero_mean_cov.values / mean_single_cov
})
copy_number_df.to_csv("copy_number_estimates.tsv", sep="\t", index=False)
This works well for two genes, but for the third one I get no reads mapped and no coverage:
- heterologous_gene, avg_coverage, estimated_copy_number
- gene1, 0.0, 0.0
- gene2, 803.0, 1.5689722547870262
- gene3, 564.0, 1.1019929660023446
I also tried:
- Relaxing alignment parameters (lower identity thresholds) but I still get 0 reads for gene1
- Searching with k-mers and I get partial matches to the gene sequence, but the full gene does not map
- Repeating with another sample, but getting the same result
The strain still grows on the substrate that the missing gene should metabolise, so functionally the gene seems to be present.
Could this gene be located on a plasmid or in mitochondrial DNA? And if so, why might I not be seeing any reads mapping, given it is still DNA?
Any idea would be greatly appreciated and thank you so much again!
You will have to identify where the gene got inserted in the genome (assuming it is integrated and not being maintained extra-chromosomally): identifying transgene insertion site in WGS
As for the number of copies you may be able to find the insertion sites (if there is more than one) but finding if there is more than one copy inserted at each location may be difficult without access to long read data.