Hi,
I am a new graduate student in biology and am relatively new to sequencing in general. I am planning on doing a genome-wide CRISPR screen over many days. I plan to extract genomic DNA from each condition and amplify the sg region with primers that include the illumina adaptors and a barcode on the reverse primer. I will pool all of these barcoded samples together and run them via NovaSeq paired end 100bp sequencing. The barcode is most certainly within the first 100 bp of the reverse read and with paired end I should be able to tell which forward read it corresponds to.
I recently sent the library off for sequencing to determine representation of each sgRNA in the library using the exact same sequence parameters. Unfortunately, most of the primer (except the part that annealed to the backbone of the plasmid) including barcode was not present in the reverse sequence. However, I told the core that sequenced my samples which index I used and that index was in the information line of each record in the fastq file. The core informed me they did no preprocessing of the reads.
So my question— is there any way for the illumina sequencing machine to know which index is present if it isn’t present in the read? Also, my PCR product size corresponds with the whole p7 primer being present in the product, so why isn’t some of the reverse primer present in the sequence reads? Do I need to increase the size of reads when I sequence my screen in order to demultiplex?
Thanks in advance for any help!
No there isn't that is if the data is not in the main read/index reads. Sounds like you did not use standard Illumina indexing scheme where the indexes are read as separate reads (they are not part of the main sequence for any type of run).
Had you got your constructs validated from your sequencing core/someone else who was knowledgeable about Illumina sequencing before you went through with the experiment?
It may be possible to salvage this data but we would need to know specific details about your construct, locations of the indexes, sequencing primer and how the sequencing was done.
Yes, the constructs were validated— I got them from addgene. The backbone is the commonly used lenti guide puro plasmid with sgRNAs inserted at known relative levels based on previous validation. I amplified the library and was trying to determine the new distribution of sgRNAs within the library. I am pretty sure the primers and scheme we are using are widely used (we used a protocol published on addgene and from the Broad Institute). I only received the fastq files, no index sequence files. A screenshot of the P5 and P7 primers I used is below. The final PCR product is 354 bp and each p5 and p7 primer is about 100 bp.
Good to know that this is not a full custom design but a standard commercial protocol. Based on the info above you know where the sequencing primer is so your reads should start at base immediately after the sequencing primer site. Once you confirm that is the case you could then separate your reads using the stagger sequences (you should be able to use
bbduk.sh
from BBTools in filter mode).Show us a couple of sequence examples from R1 and R2. Also a link for the broad/addgene protocol may be helpful.