I have a concatenated oligomer dataset that went wrong. It was supposed to have this structure:
adapter1-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-random6-mer-G-random6-mer-rightHandle..
But I was too naive to think it would work. And something went wrong during library prep and the structure is all messed-up and chimeric:
adapter1-leftHandle-randomStuff-adapter1-leftHandle-adapter1-leftHandle-random6-mer-G-random6-mer-rightHandle-leftHandle-leftHandle-randomStuff-adapter1-adapter2-adapter1 etc
For the next step in my project, I have to align these reads to a reference so that each read has the -MD tag in the alignment.bam file. I can then use these reads for the further step.
I have about 55 million reads. And their sizes range between 45 nucleotides to 4000 nucleotides. I can't just use a simple reference like
leftHandle-random6-mer-G-random6-mer-rightHandle
Because this will map only once to my read which might have more than one occurrence of this construct. And even when I did this, I couldn't figure out what kind of local alignment score matrix to use to ensure at-least aligning with one occurrence per read.
I'm not sure how to capture every occurrence of this construct through alignment.
- What kind of reference file do I make?
- What scoring matrix should I use?
I have tried using seqkit (with a maximum mismatch of 1 base) to locate the constituent parts of the oligo construct and that function gives me results in a .gtf file.
Example:
(base) bash-4.4$ cat fffb80be-82e6-4519-824d-470e02b06ce8.gtf
fffb80be-82e6-4519-824d-470e02b06ce8 SeqKit location 1 22 0 + . gene_id "dsAdapterTop";
fffb80be-82e6-4519-824d-470e02b06ce8 SeqKit location 23 32 0 + . gene_id "leftHandle";
fffb80be-82e6-4519-824d-470e02b06ce8 SeqKit location 60 69 0 + . gene_id "leftHandle";
fffb80be-82e6-4519-824d-470e02b06ce8 SeqKit location 83 94 0 + . gene_id "rightHandle";
Each read has it's own .gtf file to locate the presence of constituent components of the oligo construct. I was wondering if these files could be of use to try and find every occurrence of the desired construct with their location.
PS: I'm using these .gtf files along with the read .fasta to look at the "features" on IGV to understand the general structure of the reads I've sequenced.
Because nanopore reads are known to be erroneous, and because the library was ssDNA sequenced without any second-strand synthesis, the secondary structures formed by the ssDNA while being sequenced might have introduced mismatches/deletions/insertions. But using seqkit with an arbitary mismatch value will take a lot of time and memory because of the size of the dataset.
I'm confident that there are chimeric reads and secondary structures because of some reads forming structures like this:
Any advice on understanding the structure of the reads, to help re-assess the library prep steps is welcome.
Edit 1: I have solved summarising this information:
Step 1. collapse the columns in the gtf file to a single string containing the coordinates and the feature name only. Each feature following it's coordinates will be delimited by ">". Example:
1-33:dsAdapterBottom>34-43:leftHandle>57-70:rightHandle
Step 2. Randomly sample 1000 lines from the resulting file Step 3. Extract the fasta sequences from the master fasta file for these reads Step 4. Put these 1000 sequences through the zipfold algorithm on the UNAfold webserver to obtain Gibbs Free Energy values for the secondary structures formed by these reads Step 5. Get the secondary structures for a randomly samples 100 reads and visually analyse them on slides Step 6. Create an excel with the following format:
SeqID FASTA.Squence Zipfold dG (Kcal/mol) 20C dG.Numeric Seq.Len Features
001aa3f0-5809-43ee-a3af-65dca5266564 GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTTACCACAAACTTGTACGAATGATACAAGTTTGTGGTAAGGCTTCTTCTACACCCAAGTACCACAAACAGCAATAATTT -62.802 kcal/mol -62.802 110 1-33:dsAdapterBottom>33-42:leftHandle>90-99:leftHandle
0096b23b-fa24-4713-b228-58f635d15090 GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCAGGCTTCTTCTACACGCCCAAGTGCCGCAAACAGCAATATTTTA -37.879 kcal/mol -37.879 75 1-33:dsAdapterBottom>32-53:dsAdapterTop
00a2065d-4b40-4219-bd4a-f7a3aede74ee GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTTACCACAAACTGGCTAGGGTGTG -10.958 kcal/mol -10.958 55 1-33:dsAdapterBottom>33-42:leftHandle
00d9cd7d-e545-425f-862c-b3c9d47faf2c AGGCTTTTCTACACACCCAAGTACCACAAACTACCACAAACCGTTTGTGGTAGTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTGATAATGT -55.421 kcal/mol -55.421 92 22-31:leftHandle>32-41:leftHandle>53-85:dsAdapterBottom
010be89c-ce49-4fa2-9112-9b282de5f16a GTTTGTGGTACTTGGGTGTGTAGAAGAAGCCTAGCAATACGTTTTGTTGACGTATGCTGTTTGTGGTACTTGGGTGTGTAGAAAGAGCCTAGGCTTCTTCTAGCCCCCGCAAACGATGATT -29.724 kcal/mol -29.724 121 1-33:dsAdapterBottom>59-91:dsAdapterBottom