Question

Estimating copy number of heterologous genes from WGS data in Yarrowia lipolytica

0

Entering edit mode

18 days ago

Giorgia ▴ 10

Hello,

I have paired-end Illumina WGS data (R1.fq / R2.fq) for a Yarrowia lipolytica strain that contains some heterologous genes. These genes are absent from the reference genome and were integrated into the strain. My goal is to determine the copy number of each heterologous gene (expected: copy number in the reference genome is 0) in the genome.

What is the best workflow to estimate copy number for genes not in the reference genome?

Any steps, best practices, or example workflows would be very useful as I am very new to this!

Thank you!

wgs CNVs copy-number • 2.8k views

ADD COMMENT • link updated 14 days ago by colindaven 7.8k • written 18 days ago by Giorgia ▴ 10

2

Entering edit mode

You will have to identify where the gene got inserted in the genome (assuming it is integrated and not being maintained extra-chromosomally): identifying transgene insertion site in WGS

As for the number of copies you may be able to find the insertion sites (if there is more than one) but finding if there is more than one copy inserted at each location may be difficult without access to long read data.

ADD REPLY • link 18 days ago by GenoMax 153k

score 1 · Answer 1 · 2025-08-14

I agree with what Genomax said.

Also try an assembly with Spades which should be easy enough for this small genome. It will not be perfect - hint - long reads are far better for assembly - but might give you some hints.

Then compare the contigs to the ref genome with a tool like RagTag.

It could be your (assuming multi copy) extra genes are assembled together if they are highly related, so have a look for false "heterozygous" SNPs and other fun which may indicate "overassembly" of related copies. If you have or can get long reads too it will be far easier and more definitive with an assembler like hifiasm.

score 1 · Answer 2 · 2025-08-14

I am not sure if the terminology is exactly the same but i am going to assume heterologous gene and transgene is a similar concept. there may be some specialized tools to help find transgene insertion sites

here is a tool called TC-hunter for paired end reads https://pmc.ncbi.nlm.nih.gov/articles/PMC8859905/ https://github.com/vborjesson/TC_hunter

here is one for nanopore https://github.com/jhuapl-bio/TgIF

searching the literature for similar transgene insertion site finders has many papers. I couldn't find anything specifically for just copy number estimation. it is possible that maybe you don't need to find the exact insertion site to estimate copy number, and could instead do it using read coverage or k-mer spectrum, but I didn't find references to back that up yet

score 1 · Answer 3 · 2025-08-15

Hi all,

Thank you so much for the suggestions!

I initially considered assembling the genome, but since I only have short-read data, I thought it might be tricky. Instead, I used a mapping-based approach where I created a "custom reference genome" by adding my three heterologous genes as separate contigs. My workflow looks like this:

# Step 1: Index the custom reference genome
bwa index custom_reference.fasta

# Step 2: Align paired-end reads to the reference genome
bwa mem custom_reference.fasta R1.fq R2.fq > aligned.sam

# Step 3: Convert SAM to sorted/indexed BAM
samtools view -Sb aligned.sam > aligned.bam
samtools sort aligned.bam -o aligned.sorted.bam
samtools index aligned.sorted.bam

# Step 4: Calculate per-base coverage
samtools depth aligned.sorted.bam > depth.txt

# Step 5: Calculate coverage for specific regions
bedtools coverage -a heterologous_genes.bed -b aligned.sorted.bam > heterologous_coverage.tsv
bedtools coverage -a single_copy_genes.bed -b aligned.sorted.bam > single_copy_coverage.tsv

Finally, I used a small Python script to estimate copy number by taking the average coverage of each heterologous gene and dividing it by the mean coverage of known single-copy endogenous genes:

heterologous_file, single_copy_file = sys.argv[1], sys.argv[2]
colnames = ["chr", "start", "end", "gene", "bases_covered", "length", "region_length", "coverage_fraction"]

hetero_df = pd.read_csv(heterologous_file, sep="\t", header=None, names=colnames)
single_df = pd.read_csv(single_copy_file, sep="\t", header=None, names=colnames)

hetero_mean_cov = hetero_df.groupby("gene")["bases_covered"].mean()
single_mean_cov = single_df.groupby("gene")["bases_covered"].mean()
mean_single_cov = single_mean_cov.mean()

copy_number_df = pd.DataFrame({
    "heterologous_gene": hetero_mean_cov.index,
    "avg_coverage": hetero_mean_cov.values,
    "estimated_copy_number": hetero_mean_cov.values / mean_single_cov
})

copy_number_df.to_csv("copy_number_estimates.tsv", sep="\t", index=False)

This works well for two genes, but for the third one I get no reads mapped and no coverage:

heterologous_gene, avg_coverage, estimated_copy_number
gene1, 0.0, 0.0
gene2, 803.0, 1.5689722547870262
gene3, 564.0, 1.1019929660023446

I also tried:

Relaxing alignment parameters (lower identity thresholds) but I still get 0 reads for gene1
Searching with k-mers and I get partial matches to the gene sequence, but the full gene does not map
Repeating with another sample, but getting the same result

The strain still grows on the substrate that the missing gene should metabolise, so functionally the gene seems to be present.

Could this gene be located on a plasmid or in mitochondrial DNA? And if so, why might I not be seeing any reads mapping, given it is still DNA?

Any idea would be greatly appreciated and thank you so much again!