Question

Convert Long Nanopore reads to Illumina Paired end reads

1

Entering edit mode

10 weeks ago

Mark ▴ 30

Is there a way to convert long, nanopore reads into illumina paired end reads (not actual illumina reads just simulated reads). In the sense that only the ends of the nanopore reads are retained and these are split into paired end files. I need this sort of thing for a niche subworkflow that takes in distant illumina paired end reads but I only have nanopore reads of this section.

illumina sequencing reads nanopore • 467 views

ADD COMMENT • link updated 10 weeks ago by Pierre Lindenbaum 166k • written 10 weeks ago by Mark ▴ 30

score 2 · Answer 1 · 2025-01-30

I am assuming it is possible to extract sequences of eg.150bp with a distance corresponding to the insert size you want, one of them would need to be reverse complemented. The question is how much that would help. The error profile, quality scores, and adapter sequences would not be similar. If you just want to test some tools it might be better to make an assembly from the long reads, then use a read simulator to simulate Illumina reads.

score 1 · Answer 2 · 2025-01-30

I just wrote jvarkit/biostar9608448 for fun. It takes a single-end bam and output a paire-end bam with reads having a length of 'x': https://jvarkit.readthedocs.io/en/latest/Biostar9608448/

$ java -jar dist/jvarkit.jar  biostar9608448 src/test/resources/FAB23716.nanopore.bam -L 10 | samtools view | head
44767a9a-a0b9-4d7e-a324-d0d3ea113d8c_Basecall_Alignment_template    67  chr1    17123   38  6M2D4M  =   31047   13925   GTGCGCCGCT  -2.1.,(')-  MC:Z:10M
44767a9a-a0b9-4d7e-a324-d0d3ea113d8c_Basecall_Alignment_template    147 chr1    31047   38  10M =   17123   -13925  CACCTTGAAC  &#&$$$%$$%  MC:Z:6M2D4M
d324a4bc-aa2c-4ee8-be69-934cc58c0003_Basecall_Alignment_template    67  chr1    38469   1   10M =   43735   5267    ATGCTGCCTG  2-,.314443  MC:Z:10M
d324a4bc-aa2c-4ee8-be69-934cc58c0003_Basecall_Alignment_template    147 chr1    43735   1   10M =   38469   -5267   AGCAAACTTT  -',12()./.  MC:Z:10M
76862e2e-98eb-4ad3-a523-6a8709c0b56a_Basecall_Alignment_template    67  chr1    44403   0   10M =   44481   79  TCAACAACAA  &&&%&)'''&  MC:Z:10M
76862e2e-98eb-4ad3-a523-6a8709c0b56a_Basecall_Alignment_template    147 chr1    44481   0   10M =   44403   -79 GGTAGCCGAA  ''&$%(((%*  MC:Z:10M
3330d9a6-d2a9-423b-accc-92a6d1fe646e_Basecall_Alignment_template    67  chr1    52105   1   8M2D2M  =   53738   1634    ATTCCTACGA  %).,.%+$))  MC:Z:10M
3330d9a6-d2a9-423b-accc-92a6d1fe646e_Basecall_Alignment_template    147 chr1    53738   1   10M =   52105   -1634   ACTTAGGCAA  ,)((%%''((  MC:Z:8M2D2M
c6055e6a-9a1c-4126-84ec-64549fd4d264_Basecall_Alignment_template    67  chr1    63945   5   10M =   67887   3943    TCACCATGAT  *+'*-.111-  MC:Z:10M
c6055e6a-9a1c-4126-84ec-64549fd4d264_Basecall_Alignment_template    147 chr1    67887   5   10M =   63945   -3943   AGTATTATCA  +$+*+/((*&  MC:Z:10M

score 0 · Answer 3 · 2025-01-30

Depending on how long your nanopore reads are (think 1 to several kb's) and how many of them, you have you may be able to generate a set of Illumina reads using something like randomreads.sh from BBMap suite. You may need to first convert the reads into fasta format using reformat.sh and then use them as input for the illumina read generation.

As Michael points out this would assume that you are interested in just the sequence and not the error profile of the data you have.

In the sense that only the ends of the nanopore reads are retained and these are split into paired end files.

If that is an absolute requirement you may need to actually write something custom to extract ends.