Problem with mapping rate while using Mageck to process CRISPR screen data
0
0
Entering edit mode
9 weeks ago
Luwell • 0

Hi! I'm a beginner in bioinformatics analysis.I want to analyze the RAWDATA from a CRISPR screen paired-end sequencing experiment.

The library structure used by the company is: 5' adapter1-Index2(i5)-primer1-insert fragment-primer2-Index1(i7)-adapter1'.

I used fastp and cutadapt to remove the known adapters and primers in R1 and R2, and obtained CLEANDATA. The reads are around 150bp in length, and most of the sgRNA sequences start from the 43rd position in R1, with a length of 20bp.

After that, I processed the CLEANDATA using MAGeCK with the code:

mageck count -l library.csv -n countA --fastq L1_R1.fq.gz --fastq-2 L1_R2.fq.gz

mageck can automatically detect the position of the sgRNA and trim-5. enter image description here

My problem is that the highest mapped for CLEANDATA (6 samples) is only around 70%, and I m unable to improve this result. If I use cutadapt to trim the sequences on both sides of the sgRNA in the 150bp reads, it might lower the mapped even further. Additionally, if I count directly using the RAWDATA , the mapped percentage is around 69%.

enter image description here

I can t find any answer of this question.

Could this be a problem with the data provided by the company, or am I missing a crucial step in my processing? How can I improve it?

Thanks.

crispr screen CRISPR-screen mageck mapping • 580 views
ADD COMMENT
0
Entering edit mode

Have you tried to see if the following helps

a) use just the R1 file
b) hard trim the reads so that sgRNA starts within a few bases at start with just R1 file?

R2 file is not going to add any additional information.

ADD REPLY
1
Entering edit mode

MAGeCK also tosses reads with any mismatches from your known sequences, so you could consider aligning yourself and feeding in a count matrix as mentioned in their tutorial, which may bump the numbers a bit.

From experience, we typically get 70-80% mapping reads via mageck count and the downstream analyses are fine.

ADD REPLY
0
Entering edit mode

Thank you very much!

I previously used Bowtie2 with the code:

bowtie2 -x index -U A_R1.fq.gz -S A_R1.sam --score-min L,-1,-0.5 -N 1 --norc

However, the mapping rate was very low, only about 2%. I suspect this might be due to incomplete removal of the 5' adapter, leaving 1-3bp behind. I'm not sure how to set the trimming length properly in this case.If I use cutadapt with the -a , the adapter sequences are not exactly the same.

enter image description here

If I just skip using Bowtie2 and directly proceed with the dataset that has around 70% mapped (as mentioned earlier), would this be a feasible method for downstream analysis?

ADD REPLY
1
Entering edit mode

Usually in CRISPR and shRNA-like screens you precisely know where your barcode starts, so you don't need to trim adapters blindly but can hard-crop your reads to only span the exact barcode sequence. For bowtie2, I use the options --end-to-end --very-sensitive --rdg 10000,10000 --rfg 10000,10000 --mp 10000,10000 which means only perfect end-to-end matches get a high mapping score. Any gaps or mismatches are penalized highly and will go unmapped.

ADD REPLY
0
Entering edit mode

Yes, I would not be overly concerned unless it's impacting your targeted representation and there's no remaining material for top-off sequencing.

ADD REPLY
0
Entering edit mode

The results using only R1 are also around 70%.And I tried to trim about 40bp from R1, as most sgRNAs are located in this region, to make the sgRNA start within the 2nd, 3rd, or 4th base. However, the results didn't change much, and in fact, the more I trimmed, the slightly lower the mapped became.

I guess it might be because there are very few sgRNAs located earlier, and trimming them has resulted in not being able to detect them. So, I don't know what else I can do.

ADD REPLY

Login before adding your answer.

Traffic: 1695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6