RNA-seq Nanopore and Illumina hybrid assembly to improve gene annotation
1
0
Entering edit mode
4.7 years ago
nlehmann ▴ 150

Hello,

I am working with chicken data and we would like to improve the genes annotation by combining short (Illumina) and long reads (Nanopore) data. Thus we decided to build a de novo transcriptome assembly, guided by the available genome of the chicken.

I tried different approaches:

  • StringTie2 that gives a lot of artifacts (we end up with 60,000 genes !)
  • Scallop-LR that does not work (only with Pacbio data)
  • Scallop that works fine but also gives a lot of artifacts

In each case, I tried to run a) the 2 datasets together (in one run) and b) the 2 datasets separately and then merge the results. The caveat of a) is that the parameters used for long reads are very different than the one for short reads, so I have to choose something "in-between" which is not optimized. The caveat of b) is that it is inscreasing the number of genes detected because we keep lots of artifactual transcripts.

Of course, I could use more stringent parameters for the merging, but I am wondering whether any of you have the experience of dealing with the integration of short and long reads ? How would you reduce the number of false positives ? I know I could also use: Mikado, Trinity, IDP-denovo for this kind of issues. Any feedback on using these tools (or any other) in this context would be welcome !

Thanks

RNA-Seq Nanopore Illumina assembly • 1.5k views
ADD COMMENT
0
Entering edit mode

Have you used the available chicken transcripts?

ADD REPLY
0
Entering edit mode

Yes, but we found lots of signal outside of the annotated genes and wanted to investigate further

ADD REPLY
0
Entering edit mode

Do you get a lot of full-length RNAs in the Nanopore?Try to map it to the known transcriptome. I would take an approach of assembling with the long reads and correcting with short reads. It's black magic though, there is no one protocol that works for all.

ADD REPLY
0
Entering edit mode

The coverage I have in Nanopore is very low, so I'm not sure this would be so relevant but I could try.

ADD REPLY
0
Entering edit mode
4.7 years ago

Why not use GMAP with the gff3 output setting to map the a) generated transcripts b) nanopore reads to the genome. Then you can check the relative quality.

I can't think of a way you'll gain much trying to correct very low coverage nanopore data. Maybe more data is needed ?

Also, you could filter the Stringtie transcripts based on those containing an ORF of a suitable size ? Eg, using transdecoder.

Sounds a bit like a major data integration project, which will take a lot of time. To avoid spending too much time on it generation of new data (or finding more public data) might save you a lot of time and get much better results.

ADD COMMENT

Login before adding your answer.

Traffic: 2171 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6