Estimating copy number of heterologous genes from WGS data in Yarrowia lipolytica
3
0
Entering edit mode
18 days ago
Giorgia ▴ 10

Hello,

I have paired-end Illumina WGS data (R1.fq / R2.fq) for a Yarrowia lipolytica strain that contains some heterologous genes. These genes are absent from the reference genome and were integrated into the strain. My goal is to determine the copy number of each heterologous gene (expected: copy number in the reference genome is 0) in the genome.

What is the best workflow to estimate copy number for genes not in the reference genome?

Any steps, best practices, or example workflows would be very useful as I am very new to this!

Thank you!

wgs CNVs copy-number • 2.8k views
ADD COMMENT
2
Entering edit mode

You will have to identify where the gene got inserted in the genome (assuming it is integrated and not being maintained extra-chromosomally): identifying transgene insertion site in WGS

As for the number of copies you may be able to find the insertion sites (if there is more than one) but finding if there is more than one copy inserted at each location may be difficult without access to long read data.

ADD REPLY
1
Entering edit mode
17 days ago

I agree with what Genomax said.

Also try an assembly with Spades which should be easy enough for this small genome. It will not be perfect - hint - long reads are far better for assembly - but might give you some hints.

Then compare the contigs to the ref genome with a tool like RagTag.

It could be your (assuming multi copy) extra genes are assembled together if they are highly related, so have a look for false "heterozygous" SNPs and other fun which may indicate "overassembly" of related copies. If you have or can get long reads too it will be far easier and more definitive with an assembler like hifiasm.

ADD COMMENT
1
Entering edit mode
17 days ago
cmdcolin ★ 4.3k

I am not sure if the terminology is exactly the same but i am going to assume heterologous gene and transgene is a similar concept. there may be some specialized tools to help find transgene insertion sites

here is a tool called TC-hunter for paired end reads https://pmc.ncbi.nlm.nih.gov/articles/PMC8859905/ https://github.com/vborjesson/TC_hunter

here is one for nanopore https://github.com/jhuapl-bio/TgIF

searching the literature for similar transgene insertion site finders has many papers. I couldn't find anything specifically for just copy number estimation. it is possible that maybe you don't need to find the exact insertion site to estimate copy number, and could instead do it using read coverage or k-mer spectrum, but I didn't find references to back that up yet

ADD COMMENT
1
Entering edit mode
16 days ago
Giorgia ▴ 10

Hi all,

Thank you so much for the suggestions!

I initially considered assembling the genome, but since I only have short-read data, I thought it might be tricky. Instead, I used a mapping-based approach where I created a "custom reference genome" by adding my three heterologous genes as separate contigs. My workflow looks like this:

# Step 1: Index the custom reference genome
bwa index custom_reference.fasta

# Step 2: Align paired-end reads to the reference genome
bwa mem custom_reference.fasta R1.fq R2.fq > aligned.sam

# Step 3: Convert SAM to sorted/indexed BAM
samtools view -Sb aligned.sam > aligned.bam
samtools sort aligned.bam -o aligned.sorted.bam
samtools index aligned.sorted.bam

# Step 4: Calculate per-base coverage
samtools depth aligned.sorted.bam > depth.txt

# Step 5: Calculate coverage for specific regions
bedtools coverage -a heterologous_genes.bed -b aligned.sorted.bam > heterologous_coverage.tsv
bedtools coverage -a single_copy_genes.bed -b aligned.sorted.bam > single_copy_coverage.tsv

Finally, I used a small Python script to estimate copy number by taking the average coverage of each heterologous gene and dividing it by the mean coverage of known single-copy endogenous genes:

heterologous_file, single_copy_file = sys.argv[1], sys.argv[2]
colnames = ["chr", "start", "end", "gene", "bases_covered", "length", "region_length", "coverage_fraction"]

hetero_df = pd.read_csv(heterologous_file, sep="\t", header=None, names=colnames)
single_df = pd.read_csv(single_copy_file, sep="\t", header=None, names=colnames)

hetero_mean_cov = hetero_df.groupby("gene")["bases_covered"].mean()
single_mean_cov = single_df.groupby("gene")["bases_covered"].mean()
mean_single_cov = single_mean_cov.mean()

copy_number_df = pd.DataFrame({
    "heterologous_gene": hetero_mean_cov.index,
    "avg_coverage": hetero_mean_cov.values,
    "estimated_copy_number": hetero_mean_cov.values / mean_single_cov
})

copy_number_df.to_csv("copy_number_estimates.tsv", sep="\t", index=False)

This works well for two genes, but for the third one I get no reads mapped and no coverage:

  • heterologous_gene, avg_coverage, estimated_copy_number
  • gene1, 0.0, 0.0
  • gene2, 803.0, 1.5689722547870262
  • gene3, 564.0, 1.1019929660023446

I also tried:

  • Relaxing alignment parameters (lower identity thresholds) but I still get 0 reads for gene1
  • Searching with k-mers and I get partial matches to the gene sequence, but the full gene does not map
  • Repeating with another sample, but getting the same result

The strain still grows on the substrate that the missing gene should metabolise, so functionally the gene seems to be present.

Could this gene be located on a plasmid or in mitochondrial DNA? And if so, why might I not be seeing any reads mapping, given it is still DNA?

Any idea would be greatly appreciated and thank you so much again!

ADD COMMENT
1
Entering edit mode

Have you checked to see where the genes are actually integrated in the genome and if there are multiple locations? How much sequence data are you using in this analysis. What is the theoretical fold coverage? e.g. If genome was 1 mb and you had 10 mb of sequence in total that would be 10x coverage (not guaranteed to be uniform).

but for the third one I get no reads mapped and no coverage

If the third gene is essential for growth on the media being used then this is a puzzling result. Are trans genes truly heterologous i.e. there are no other sequences in original genome that can cross map. Perhaps the reads are being preferentially mapped to genomic regions.

Could this gene be located on a plasmid or in mitochondrial DNA?

Unless something was done to remove plasmids/mitochondria they should still be present in the original DNA. You have reads that are otherwise mapping to these regions.

ADD REPLY
0
Entering edit mode

Strange but interesting results! It really looks like gene1 is not present, the reads should be present assuming you have managed to include plasmid or mito DNA in your prep.

You can check your control strains without gene1 to see if they really cannot metabolise the substrate.

In terms of bioinformatics, you can look for functional homologues to gene1 using Psi-blast, HMMER etc. Or take a modern approach - predict a 3D model of gene1 using Colabfold etc, then use foldseek to check for highly similar structural homologues which might compensate for the lack of gene1.

ADD REPLY

Login before adding your answer.

Traffic: 1813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6