I want to analyse data from a CRISPR/Cas9 screen (control vs. treatment) and I'm using Mageck (https://sourceforge.net/projects/mageck/). Mageck calculates the trim length of reads automatically but in case the trim is variable the program recommends the user to use cutadapt.
I used cutadapt to remove the 5' sequence in front of the gRNA (20 nt) of interest (there are 12 different guides) but some reads with the complete gRNA are not trimmed because there are big deletions. I tried to increase the maximum error rate of cutadapt but the Mageck gRNA count obtained after the trimming is sligthly smaller than the obtained searching each gRNA with grep.
Is it possible to improve the trimming in order to only obtain the gRNAs sequences (the 3' is not necessary to be eliminated)?
This topic is similar to my problem: https://github.com/marcelm/cutadapt/issues/261
If I am understanding this right you may be able to use
bbduk.sh
from BBMap suite. Find thegRNA
sequence usingliteral=gRNA_sequence ktrim=l
. This will find the gRNA and then remove everything 5' to that hit.The problem is that I have to analyse data generated by 60,000 gRNAs (all of them have to be searched in each fastq). Is it possible to do it with that number of gRNAs?
Should be possible. Put the sequences as multi-fasta in a file and then feed the file to
bbduk.sh
withref=gRNA_file.fa
. Bump the memory you assign by setting a higher number with-Xms20g
(for example).This code works (it finds the sequences) but it removes the gRNA sequence... I would be grateful if you could answer again.
Take a look at BBduk guide here to see if you can make it work for your use case. Sounds like you want to keep the
gDNA
sequence so this is not quite the intended use for bbduk in trim mode.Yes. that's the problem but I took a look to the guide and I didn't find any option to do what I want to do.
This topic is almost 5 years old, but I am facing the same (or at least a very similar) issue.
I have ngs files from a CRISPR screen that I wan to analyse for guide abundance. I tried three options in general. Two are shown here:
1) I used
cutadapt
with different options:2) I also checked
bbduk.sh
. This seems to be the most straight-forward way, but for some reasons, it never searches for all guides. The guides are presented in a fasta file, but I always get the outputX kmers added
and X is always lower then it should be (e.g. 77362 instead of 77441 for the Brunello library), even thoughpy_fasta_validator
returns0
:I would not need the trimmed reads, because the stats file already has all the counts in it. bbduk finds the most guides, but it does not search for all. Also for
bbduk.sh
spaces in the path do not work. I tried escaping with\
and putting the path"'path with spaces'"
, but none does work.Is there a way for
cutadapt
to check if only one or the other adapter is present? Is there a way to check, whybbduk.sh
does not recognize all kmers from the fasta file?Is there a gold standard how to count guides from NGS files, if the guide starts at different positions? Letting MAGeCK decide for itself does not work good in this case.
Can you try lowering the
k
tok=9
?You don't have any duplicate sequences in your library file? That may be one reason why bbduk is reporting less.
Are you using a scheme where there is a variable length spacer on 5' side of gRNA? If so you could trim that sequence off using
ktrim=l
and then useMAGeCK
on the left trimmed reads.Have you tried it?
Wouldn't that mess with the results since all my kmers are 20bp?
No, I don't I double checked that. I also checked the names and they are also all unique.
That's basically what I try with
cutadapt
, but yes, maybe I could givebbduk.sh
a chance for that, if I can figure out the fasta issue.Yes, the results are okish, but some guides are not found this way
In general, all results are okish and show similar overall results, but I feel there must be a good way to capture all guides that are in the fastq file.
I don't think every guide is supposed to show up in these experiments as far as I have seen. You will get some guides with 0 counts as you do with RNAseq data.
Probably not you are looking to get that good initial match which is then going to be extended.
I think you got me wrong here. Sure, you will loose some guides during the whole process from library cloning to the final analysis. I just wanted to say, if you look at a fastq file you should be able to tell for every read definitely, if it has a perfect match for a guide in it and what guide it is. So, there should be a deterministic correct count for every guide.
Hmm, So what I am trying to do with
bbduk
is to search directly for my guides, not for an adapter. If I reducek
it gives me tons of kmers. More then 10,000 for k=19. That's not really what I want with my strategy.Not every read in your data is going to match a guide. There will be a set of odd reads that don't seem to match expected pattern. Is your data single or paired-end?
MAGeCK
will accept paired-end data (which should overlap).Got it.
That is also true for sure. Let me clarify what I mean. For every read you should be able to decide, if it has a guide in it or not. If no, then you just discard the whole read. If it has a guide, you add 1 to the total count of this guide. That's what I mean with "deterministic correct count for every guide". This could simply be done with two for loops in python (or any other language like:
Of course, you should only find one guide per read and it's not really optimized for efficiency and so on, but this seems so simple that I wonder, why there is no implementation for it. And the chances of finding a guide by random in your NGS data is pretty low I'd say.
No, I have single end data :).
If you know what the boundaries of your construct look like then trim the left-end of the read using the tag on that end (
ktrim=l
). Then do the same for the right end in a separatebbduk.sh
step (ktrim=r
) using reads remaining after step 1. At second step, only keep reads that are exactly 20 bp (or however long your gDNA are) after these two trimming operations.You can then use
clumpify.sh
to get counts for unique sequences that remain. Compare them with your guide reference to see what is present.