Hi, I've assembled a genome for a non-model plant species using nanopore reads. As part of my process I polished it with medaka (nanopore reads) and nextpolish (illumina reads). It seemed fine at the time based on busco scores and mapping rates, but further down the track I've been looking at a specific gene family and realised that two of these genes close to each other on a chromosome have large predicted introns (~10,000bp).
At each end of the introns there are half gene-like sequences and there is also a full gene like sequence in the middle...they have inframe stop codons in the assembly though. I aligned just the illumina reads and realised that some of the stop codons in my assembly in this region are not supported by the illumina reads - meaning there are potentially 6 genes (from tandem duplication) instead of the two I identified!
I'm not able to reassemble the whole genome as I am nearing the end of my PhD, but my supervisors want me to figure out a proper answer before submitting the paper...
I'm not sure what to do. Would it be acceptable to reassemble the whole chromosome or just this region? Or what about re-polishing with nextpolish?
Thanks for the answer!
Here's a diagram I made previously which explains it. Only the first half of the gene is shown, but there is another full-length BLAST hit in the middle of the ~10,000 bp predicted intron and then a similar situation at the end of the region - a full length BLAST hit, but the predicted CDS only covers half of it.
I've aligned RNAseq also and it gives basically the same result, although it's difficult to know if the reads are correctly aligned (hisat2 splice aware aligning makes it really messy whereas forcing insert size makes it neat but assumes that the large intron is fake - plus all these sequences have high similarity).
I've been thinking about making an illumina alignment, but I'm unsure how I would approach that. Should I reassemble the whole chromosome? Or just the reads that map to this region? And what aligner - reference based or de novo?
I don't know if you have time to manually edit the genome in your genes of interest to remove the ONT homopolymer errors (apparently the new 10.3 or 10.4 ONT kits are far better.
If you want to look at the introns further
The STAR tool might also be useful as a second opinion to Hisat with reads as input.
I've worked a lot on plants, and don't think 10kbp is too huge an intron at all. I've seen bigger in maize etc (you didn't say anything about genome size etc). Of course, most introns are smaller. You should have the data to resolve this.
In terms of their being potentially six genes instead of two, you should check this by a) distinct RNA-seq expression or non-overlapping genes b) domains c) functional assignments of putative transcripts (interproscan or even blastx etc).