Question

RNA-seq Nanopore and Illumina hybrid assembly to improve gene annotation

0

Entering edit mode

5.3 years ago

nlehmann ▴ 150

Hello,

I am working with chicken data and we would like to improve the genes annotation by combining short (Illumina) and long reads (Nanopore) data. Thus we decided to build a de novo transcriptome assembly, guided by the available genome of the chicken.

I tried different approaches:

StringTie2 that gives a lot of artifacts (we end up with 60,000 genes !)
Scallop-LR that does not work (only with Pacbio data)
Scallop that works fine but also gives a lot of artifacts

In each case, I tried to run a) the 2 datasets together (in one run) and b) the 2 datasets separately and then merge the results. The caveat of a) is that the parameters used for long reads are very different than the one for short reads, so I have to choose something "in-between" which is not optimized. The caveat of b) is that it is inscreasing the number of genes detected because we keep lots of artifactual transcripts.

Of course, I could use more stringent parameters for the merging, but I am wondering whether any of you have the experience of dealing with the integration of short and long reads ? How would you reduce the number of false positives ? I know I could also use: Mikado, Trinity, IDP-denovo for this kind of issues. Any feedback on using these tools (or any other) in this context would be welcome !

Thanks

RNA-Seq Nanopore Illumina assembly • 1.9k views

ADD COMMENT • link updated 5.2 years ago by colindaven 7.7k • written 5.3 years ago by nlehmann ▴ 150

0

Entering edit mode

Have you used the available chicken transcripts?

ADD REPLY • link 5.3 years ago by GenoMax 152k

0

Entering edit mode

Yes, but we found lots of signal outside of the annotated genes and wanted to investigate further

ADD REPLY • link 5.3 years ago by nlehmann ▴ 150

0

Entering edit mode

Do you get a lot of full-length RNAs in the Nanopore?Try to map it to the known transcriptome. I would take an approach of assembling with the long reads and correcting with short reads. It's black magic though, there is no one protocol that works for all.

ADD REPLY • link 5.3 years ago by Asaf 10k

0

Entering edit mode

The coverage I have in Nanopore is very low, so I'm not sure this would be so relevant but I could try.

ADD REPLY • link 5.3 years ago by nlehmann ▴ 150

score 0 · Answer 1 · 2020-04-29

Why not use GMAP with the gff3 output setting to map the a) generated transcripts b) nanopore reads to the genome. Then you can check the relative quality.

I can't think of a way you'll gain much trying to correct very low coverage nanopore data. Maybe more data is needed ?

Also, you could filter the Stringtie transcripts based on those containing an ORF of a suitable size ? Eg, using transdecoder.

Sounds a bit like a major data integration project, which will take a lot of time. To avoid spending too much time on it generation of new data (or finding more public data) might save you a lot of time and get much better results.