Question

Identification of transgene insertion site using Illumina DNA Sequencing

0

Entering edit mode

20 months ago

ccstaats ▴ 40

I would like to request some help. We isolated some mutants with interesting phenotypes serendipitously when trying to generate mutants by targeted integration of a deletion cassete in a yeast. We confirm by PCR that the integration of the deletion cassete was ectopic and not in the gene of interest. To identify the chromosomal location of the deletion cassete, we used Illumina DNA sequencing for each of the ectopic insertion mutants (40 million reads 100 nt paired). Now, I am trying to identify the insertion site. My first approach was de novo assembly. However, I could not link the selection marker (transgene) to homologous DNA sequences. Even with relative good assembly statistics (genome size is about 18 mb).

So, I was wondering if I can use a SNV predictor to identify, any advices? I have tried BASIL-anise, but cannot identify among the insertion sequences the selection marker. As the selection marker is flanked by homologous sequences, these may be duplicated. If so, the insertion site could be tracked by the presence of duplicated sequences (near 1kb). Is my rationale ok? Any advices to find such large insertion sequences (deletion cassete is about 9 kb) or duplicates sequences (homologous sequences are about 1 kb). Thank you advice. Best, Charley

Strain 1 quast report

    # contigs (>= 1000 bp)      98          
    # contigs (>= 5000 bp)      83          
    # contigs (>= 10000 bp)     76          
    # contigs (>= 25000 bp)     72          
    # contigs (>= 50000 bp)     61          
    Total length (>= 0 bp)      17460253    
    Total length (>= 1000 bp)   17329730    
    Total length (>= 5000 bp)   17296753    
    Total length (>= 10000 bp)  17237993    
    Total length (>= 25000 bp)  17174558    
    Total length (>= 50000 bp)  16764377    
    # contigs                   116         
    Largest contig              716379      
    Total length                17342266    
    GC (%)                      47.84       
    N50                         394827      
    N90                         114158      
    auN                         379622.1    
    L50                         17          
    L90                         50          
    # N's per 100 kbp           0.00

Strain 2 quast report

 # contigs (>= 1000 bp)      111         
 # contigs (>= 5000 bp)      92          
 # contigs (>= 10000 bp)     84          
 # contigs (>= 25000 bp)     75          
 # contigs (>= 50000 bp)     62          
 Total length (>= 0 bp)      17585912    
 Total length (>= 1000 bp)   17386693    
 Total length (>= 5000 bp)   17343986    
 Total length (>= 10000 bp)  17285446    
 Total length (>= 25000 bp)  17126139    
 Total length (>= 50000 bp)  16614740    
 # contigs                   149         
 Largest contig              778826      
 Total length                17411279    
 GC (%)                      47.82       
 N50                         360176      
 N90                         115089      
 auN                         382490.0    
 L50                         17          
 L90                         51          
 # N's per 100 kbp           0.00

Transgene • 933 views

ADD COMMENT • link 20 months ago by ccstaats ▴ 40

0

Entering edit mode

Just an update. I will try to identify using SoftSV and perSVade. As soon as I get the results, I will post here. Best, Charley

ADD REPLY • link 20 months ago by ccstaats ▴ 40

score 0 · Answer 1 · 2023-09-05

0

Entering edit mode

20 months ago

harold.smith.tarheel ★ 5.0k

Your reads of interest will have one end mapped to the insertion/duplication sequence, and the other mapped to the genome. If you include the transgene as part of your reference genome for alignment, identification of those reads will be straightforward (i.e., filter for reads that align to the transgene with discordant mapping).

Note: a more detailed explanation including software/commands can be found in this thread.