Question

How do I summarise the structure of the oligo library I made

0

Entering edit mode

10 months ago

Mar • 0

I have a concatenated oligomer dataset that went wrong. It was supposed to have this structure:

adapter1-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-random6-mer-G-random6-mer-rightHandle..

But I was too naive to think it would work. And something went wrong during library prep and the structure is all messed-up and chimeric:

adapter1-leftHandle-randomStuff-adapter1-leftHandle-adapter1-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-leftHandle-randomStuff-adapter1-adapter2-adapter1 etc

For the next step in my project, I have to align these reads to a reference so that each read has the -MD tag in the alignment.bam file. I can then use these reads for the further step.

I have about 55 million reads. And their sizes range between 45 nucleotides to 4000 nucleotides. I can't just use a simple reference like

leftHandle-random6-mer-G-random6-mer-rightHandle

Because this will map only once to my read which might have more than one occurrence of this construct. And even when I did this, I couldn't figure out what kind of local alignment score matrix to use to ensure at-least aligning with one occurrence per read.

I'm not sure how to capture every occurrence of this construct through alignment.

What kind of reference file do I make?
What scoring matrix should I use?

I have tried using seqkit (with a maximum mismatch of 1 base) to locate the constituent parts of the oligo construct and that function gives me results in a .gtf file.

Example:

(base) bash-4.4$ cat fffb80be-82e6-4519-824d-470e02b06ce8.gtf
fffb80be-82e6-4519-824d-470e02b06ce8    SeqKit  location    1   22  0   +   .   gene_id "dsAdapterTop";
fffb80be-82e6-4519-824d-470e02b06ce8    SeqKit  location    23  32  0   +   .   gene_id "leftHandle";
fffb80be-82e6-4519-824d-470e02b06ce8    SeqKit  location    60  69  0   +   .   gene_id "leftHandle";
fffb80be-82e6-4519-824d-470e02b06ce8    SeqKit  location    83  94  0   +   .   gene_id "rightHandle";

Each read has it's own .gtf file to locate the presence of constituent components of the oligo construct. I was wondering if these files could be of use to try and find every occurrence of the desired construct with their location.

PS: I'm using these .gtf files along with the read .fasta to look at the "features" on IGV to understand the general structure of the reads I've sequenced.

Because nanopore reads are known to be erroneous, and because the library was ssDNA sequenced without any second-strand synthesis, the secondary structures formed by the ssDNA while being sequenced might have introduced mismatches/deletions/insertions. But using seqkit with an arbitary mismatch value will take a lot of time and memory because of the size of the dataset.

I'm confident that there are chimeric reads and secondary structures because of some reads forming structures like this:

2-D structure

Any advice on understanding the structure of the reads, to help re-assess the library prep steps is welcome.

Edit 1: I have solved summarising this information:

Step 1. collapse the columns in the gtf file to a single string containing the coordinates and the feature name only. Each feature following it's coordinates will be delimited by ">". Example:

1-33:dsAdapterBottom>34-43:leftHandle>57-70:rightHandle

Step 2. Randomly sample 1000 lines from the resulting file Step 3. Extract the fasta sequences from the master fasta file for these reads Step 4. Put these 1000 sequences through the zipfold algorithm on the UNAfold webserver to obtain Gibbs Free Energy values for the secondary structures formed by these reads Step 5. Get the secondary structures for a randomly samples 100 reads and visually analyse them on slides Step 6. Create an excel with the following format:

SeqID   FASTA.Squence   Zipfold dG (Kcal/mol) 20C   dG.Numeric  Seq.Len Features
001aa3f0-5809-43ee-a3af-65dca5266564    GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTTACCACAAACTTGTACGAATGATACAAGTTTGTGGTAAGGCTTCTTCTACACCCAAGTACCACAAACAGCAATAATTT  -62.802 kcal/mol    -62.802 110 1-33:dsAdapterBottom>33-42:leftHandle>90-99:leftHandle
0096b23b-fa24-4713-b228-58f635d15090    GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCAGGCTTCTTCTACACGCCCAAGTGCCGCAAACAGCAATATTTTA -37.879 kcal/mol    -37.879 75  1-33:dsAdapterBottom>32-53:dsAdapterTop
00a2065d-4b40-4219-bd4a-f7a3aede74ee    GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTTACCACAAACTGGCTAGGGTGTG -10.958 kcal/mol    -10.958 55  1-33:dsAdapterBottom>33-42:leftHandle
00d9cd7d-e545-425f-862c-b3c9d47faf2c    AGGCTTTTCTACACACCCAAGTACCACAAACTACCACAAACCGTTTGTGGTAGTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTGATAATGT    -55.421 kcal/mol    -55.421 92  22-31:leftHandle>32-41:leftHandle>53-85:dsAdapterBottom
010be89c-ce49-4fa2-9112-9b282de5f16a    GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTAGCAATACGTTTTGTTGACGTATGCTGTTTGTGGTACTTGGGTGTGTAGAAAGAGCCTAGGCTTCTTCTAGCCCCCGCAAACGATGATT   -29.724 kcal/mol    -29.724 121 1-33:dsAdapterBottom>59-91:dsAdapterBottom

nanopore-sequencing ssDNA alignment gtf • 367 views

ADD COMMENT • link 9 months ago by Mar • 0

score 0 · Answer 1 · 2024-01-25

Without fully understanding every detail of what you've done or are trying to do, I have the following advice; 1) If the library is the same length and you don't expect indels, just process the reads without aligning to a reference. 2) If structure is important, convert the pairing to a binary NxN matrix and then you can compare matrices easily to detect consensus or divergent structures and their associated energies.