Question

How to see how many times a sequence of ~7000 bp occurs in long reads?

1

Entering edit mode

6.3 years ago

bas1993 ▴ 60

So I have pacbio and MinION reads and I want to know if a genetically modified region is 1 time present or multiple times. Normal mapping doesn't give me the answers that I want as the reads are soft-clipped on the known sequence so I have no information about what else is in the read. So instead of using the known sequence as reference and the reads as query I was thinking about using the reads in a multi-fasta file as a reference and use the known sequence as query. Which tool is the best for this? a local BLAST? Or am I thinking in a wrong way.

alignment next-gen sequencing • 1.6k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 6.3 years ago by bas1993 ▴ 60

1

Entering edit mode

minimap2 should be the aligner of choice in this case. Is that what you already used?

You could do it with BLAST but depending on what kind of local alignments you see (if there are smaller repeat regions) you may have to play with BLAST settings to get the result you need. If you expect the 7kb sequence (or parts of it) to be present as is in the reads then even blat may be useful to find them.

ADD REPLY • link 6.3 years ago by GenoMax 152k

0

Entering edit mode

But will I not get the same problem as with other mappers. I know the genetic modification is present in the reads, but I would like to know if it is present multiple times (or 1 time completely and then half of the sequence).

ADD REPLY • link 6.3 years ago by bas1993 ▴ 60

0

Entering edit mode

Do you not need to assemble first to figure this out? How will you (easily) detect multiple genetically modified versions when you will already get many, many reads mapping?

ADD REPLY • link 6.3 years ago by Joe 22k

0

Entering edit mode

I already did an assembly and in the .gfa files it looked like circular contig that is connected to the chromosome. So I hypothised that maybe the assembly was influenced by some of the reads that are shorter than the modification. That is why I extracted all the reads that are longer than the modification and wanted to some kind of alignment.

ADD REPLY • link 6.3 years ago by bas1993 ▴ 60

0

Entering edit mode

I did not do this with a 7 kb sequence but with a test sequence (there are two copies in second read) I see the following.

$ more query.fa 
>test
TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATT

$ less -S new.fq 
@a0ab7a02-b816-4921-ae1a-61fe262cb994 runid=a4bcea1a5f9dd19359394 read=6 ch=366 start_time=2018-06-28T00:49:12Z
TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTA
+
$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+
@a0ab7a02-b816-4921-bc1a-61fe262cb994 runid=a4bcea1a5f9dd19359394 read=7 ch=366 start_time=2018-06-28T00:49:12Z
TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTATAGAGATCGGTGGCTGGTTCAGTTACGTA
+
$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+$%''&'''&''('%',3.32./6/%$%('

Using the following, I see the two copies of query sequence

$ minimap2 -ax map-ont query.fa new.fq

@SQ     SN:test LN:77
@PG     ID:minimap2     PN:minimap2     VN:2.15-r905    CL:minimap2 -ax map-ont query.fa new.fq
a0ab7a02-b816-4921-ae1a-61fe262cb994    0       test    1       53      77M134S *       0       0       TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTA     $%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+     NM:i:0  ms:i:154        AS:i:154        nn:i:0  tp:A:P  cm:i:10 s1:i:71 s2:i:0  de:f:0
a0ab7a02-b816-4921-bc1a-61fe262cb994    0       test    1       53      211S77M7S       *       0       0       TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGCATTGAAAAAATCTTTAAAAAAATCTTTAAAAAAATGACACGCACGCAATAAATAGGGGAAAATCTTCCCCCCCAATTCGGGCACTAAAAGGAATTTTTTTTCCTTCCCTGTCATCTAGAGGAATTTATAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATTCTCAGGC $%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-01/9118;;<2.02<2135;<<-+,0<:059:;73,./301(*&)$&/3149;2))24*,37/)-467259;6)/.,.*'/-.76/7;6'-(*++29;<<6-7751:7(+89967637**+...4<.+$%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&))).$'-0 NM:i:0  ms:i:154        AS:i:154        nn:i:0  tp:A:P  cm:i:10s1:i:71  s2:i:0  de:f:0  SA:Z:test,1,+,77M218S,53,0;
a0ab7a02-b816-4921-bc1a-61fe262cb994    2048    test    1       53      77M218H *       0       0       TAGAGATCGGTGGCTGGTTCAGTTACGTATTGCTTTTTTCCCGCTCGGGCACCAAACACCAAAGAATCATTTGCATT   $%''&'''&''('%',3.32./6/%$%('++&,0199;-.8*'122-6683829;;6/5*//'&&0,1,84+&)%&)   NM:i:0  ms:i:154        AS:i:154        nn:i:0  tp:A:P  cm:i:10 s1:i:71 s2:i:0  de:f:0  SA:Z:test,1,+,211S77M7S,53,0;

ADD REPLY • link 6.3 years ago by GenoMax 152k

0

Entering edit mode

So you essentially knocked in a certain sequence which is not part of the genome, you don't know its location and want to know how many integration locations you have?

ADD REPLY • link 6.3 years ago by WouterDeCoster 48k

0

Entering edit mode

To give some more background information, I didn't knock in the gene myself but a whole plasmid is integrated in the chromosome by an external company in a food bacteria. The documentation of this modification is not correct and that is why we sequenced it. The plasmid is around 7000 basepairs and I know the location. So far I extracted some reads manually and BLASTed them. For some reads I could see 2 plasmids but other times I got some conflicting results and it is also not feasible to do this manually for thousands of reads.

ADD REPLY • link 6.3 years ago by bas1993 ▴ 60