Large intron or bad read polishing of tandemly duplicated genes?
1
0
Entering edit mode
2.4 years ago
Ben Nestor ▴ 10

Hi, I've assembled a genome for a non-model plant species using nanopore reads. As part of my process I polished it with medaka (nanopore reads) and nextpolish (illumina reads). It seemed fine at the time based on busco scores and mapping rates, but further down the track I've been looking at a specific gene family and realised that two of these genes close to each other on a chromosome have large predicted introns (~10,000bp).

At each end of the introns there are half gene-like sequences and there is also a full gene like sequence in the middle...they have inframe stop codons in the assembly though. I aligned just the illumina reads and realised that some of the stop codons in my assembly in this region are not supported by the illumina reads - meaning there are potentially 6 genes (from tandem duplication) instead of the two I identified!

I'm not able to reassemble the whole genome as I am nearing the end of my PhD, but my supervisors want me to figure out a proper answer before submitting the paper...

I'm not sure what to do. Would it be acceptable to reassemble the whole chromosome or just this region? Or what about re-polishing with nextpolish?

polishing illumina genome nanopore • 898 views
ADD COMMENT
1
Entering edit mode
2.4 years ago

If you have access to RNA-seq or public RNA-seq data for the species I would definitely align that to your assembly to check for the presence of genes. Also try multiple gene prediction tools, eg Maker and gmap with gff3 output option.

De novo assembly of this RNA-seq might also lead to complete transcripts to map to the assembly. Illumina assemblies would be frameshift free.

An image or two would also be helpful.

Nanopore is definitely the way to go for plant assemblies, but yes, some regions can have errors. Illumina assemblies are unworkably fragmented though.

You could try an assembly of just the illumina reads as well, to demonstrate the increased base accuracy but massively decreased contiguity.

ADD COMMENT
0
Entering edit mode

Thanks for the answer!

Here's a diagram I made previously which explains it. Only the first half of the gene is shown, but there is another full-length BLAST hit in the middle of the ~10,000 bp predicted intron and then a similar situation at the end of the region - a full length BLAST hit, but the predicted CDS only covers half of it.

Start of HpPHT1.11 gene

I've aligned RNAseq also and it gives basically the same result, although it's difficult to know if the reads are correctly aligned (hisat2 splice aware aligning makes it really messy whereas forcing insert size makes it neat but assumes that the large intron is fake - plus all these sequences have high similarity).

I've been thinking about making an illumina alignment, but I'm unsure how I would approach that. Should I reassemble the whole chromosome? Or just the reads that map to this region? And what aligner - reference based or de novo?

ADD REPLY
0
Entering edit mode

I don't know if you have time to manually edit the genome in your genes of interest to remove the ONT homopolymer errors (apparently the new 10.3 or 10.4 ONT kits are far better.

  • you already have the evidence here from the alignments in the screenshots
  • use a powerful text editor such as vim etc to insert the missing bases - illumina is by far the correct technology with homopolymers (alternative - biopython etc)

If you want to look at the introns further

  • reannotate the corrected genome
  • assemble the Illumina RNA-seq (trinity)
  • map the RNA contigs to the corrected genome (gmap, gff3 output)
  • If the introns still show up, they are probably correct

The STAR tool might also be useful as a second opinion to Hisat with reads as input.

I've worked a lot on plants, and don't think 10kbp is too huge an intron at all. I've seen bigger in maize etc (you didn't say anything about genome size etc). Of course, most introns are smaller. You should have the data to resolve this.

ADD REPLY
0
Entering edit mode

In terms of their being potentially six genes instead of two, you should check this by a) distinct RNA-seq expression or non-overlapping genes b) domains c) functional assignments of putative transcripts (interproscan or even blastx etc).

ADD REPLY

Login before adding your answer.

Traffic: 1224 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6