Hi! I'm a beginner in bioinformatics analysis.I want to analyze the RAWDATA from a CRISPR screen paired-end sequencing experiment.
The library structure used by the company is: 5' adapter1-Index2(i5)-primer1-insert fragment-primer2-Index1(i7)-adapter1'.
I used fastp and cutadapt to remove the known adapters and primers in R1 and R2, and obtained CLEANDATA. The reads are around 150bp in length, and most of the sgRNA sequences start from the 43rd position in R1, with a length of 20bp.
After that, I processed the CLEANDATA using MAGeCK with the code:
mageck count -l library.csv -n countA --fastq L1_R1.fq.gz --fastq-2 L1_R2.fq.gz
mageck can automatically detect the position of the sgRNA and trim-5.
My problem is that the highest mapped for CLEANDATA (6 samples) is only around 70%, and I m unable to improve this result. If I use cutadapt to trim the sequences on both sides of the sgRNA in the 150bp reads, it might lower the mapped even further. Additionally, if I count directly using the RAWDATA , the mapped percentage is around 69%.
I can t find any answer of this question.
Could this be a problem with the data provided by the company, or am I missing a crucial step in my processing? How can I improve it?
Thanks.
Have you tried to see if the following helps
a) use just the R1 file
b) hard trim the reads so that sgRNA starts within a few bases at start with just R1 file?
R2 file is not going to add any additional information.
MAGeCK also tosses reads with any mismatches from your known sequences, so you could consider aligning yourself and feeding in a count matrix as mentioned in their tutorial, which may bump the numbers a bit.
From experience, we typically get 70-80% mapping reads via
mageck count
and the downstream analyses are fine.Thank you very much!
I previously used Bowtie2 with the code:
However, the mapping rate was very low, only about 2%. I suspect this might be due to incomplete removal of the 5' adapter, leaving 1-3bp behind. I'm not sure how to set the trimming length properly in this case.If I use cutadapt with the -a , the adapter sequences are not exactly the same.
If I just skip using Bowtie2 and directly proceed with the dataset that has around 70% mapped (as mentioned earlier), would this be a feasible method for downstream analysis?
Usually in CRISPR and shRNA-like screens you precisely know where your barcode starts, so you don't need to trim adapters blindly but can hard-crop your reads to only span the exact barcode sequence. For bowtie2, I use the options
--end-to-end --very-sensitive --rdg 10000,10000 --rfg 10000,10000 --mp 10000,10000
which means only perfect end-to-end matches get a high mapping score. Any gaps or mismatches are penalized highly and will go unmapped.Yes, I would not be overly concerned unless it's impacting your targeted representation and there's no remaining material for top-off sequencing.
The results using only R1 are also around 70%.And I tried to trim about 40bp from R1, as most sgRNAs are located in this region, to make the sgRNA start within the 2nd, 3rd, or 4th base. However, the results didn't change much, and in fact, the more I trimmed, the slightly lower the mapped became.
I guess it might be because there are very few sgRNAs located earlier, and trimming them has resulted in not being able to detect them. So, I don't know what else I can do.