Question

Extract gRNA sequence using cutadapt

0

Entering edit mode

5.4 years ago

Swimming bird ▴ 20

I want to analyse data from a CRISPR/Cas9 screen (control vs. treatment) and I'm using Mageck (https://sourceforge.net/projects/mageck/). Mageck calculates the trim length of reads automatically but in case the trim is variable the program recommends the user to use cutadapt.

I used cutadapt to remove the 5' sequence in front of the gRNA (20 nt) of interest (there are 12 different guides) but some reads with the complete gRNA are not trimmed because there are big deletions. I tried to increase the maximum error rate of cutadapt but the Mageck gRNA count obtained after the trimming is sligthly smaller than the obtained searching each gRNA with grep.

Is it possible to improve the trimming in order to only obtain the gRNAs sequences (the 3' is not necessary to be eliminated)?

This topic is similar to my problem: https://github.com/marcelm/cutadapt/issues/261

cutadapt trimming crispr sequencing • 3.4k views

ADD COMMENT • link updated 11 months ago by GenoMax 150k • written 5.4 years ago by Swimming bird ▴ 20

0

Entering edit mode

If I am understanding this right you may be able to use bbduk.sh from BBMap suite. Find the gRNA sequence using literal=gRNA_sequence ktrim=l. This will find the gRNA and then remove everything 5' to that hit.

ADD REPLY • link 5.4 years ago by GenoMax 150k

0

Entering edit mode

The problem is that I have to analyse data generated by 60,000 gRNAs (all of them have to be searched in each fastq). Is it possible to do it with that number of gRNAs?

ADD REPLY • link 5.4 years ago by Swimming bird ▴ 20

0

Entering edit mode

Should be possible. Put the sequences as multi-fasta in a file and then feed the file to bbduk.sh with ref=gRNA_file.fa. Bump the memory you assign by setting a higher number with -Xms20g (for example).

ADD REPLY • link 5.4 years ago by GenoMax 150k

0

Entering edit mode

This code works (it finds the sequences) but it removes the gRNA sequence... I would be grateful if you could answer again.

ADD REPLY • link 5.4 years ago by Swimming bird ▴ 20

1

Entering edit mode

Take a look at BBduk guide here to see if you can make it work for your use case. Sounds like you want to keep the gDNA sequence so this is not quite the intended use for bbduk in trim mode.

ADD REPLY • link 5.4 years ago by GenoMax 150k

0

Entering edit mode

Yes. that's the problem but I took a look to the guide and I didn't find any option to do what I want to do.

ADD REPLY • link 5.4 years ago by Swimming bird ▴ 20

0

Entering edit mode

This topic is almost 5 years old, but I am facing the same (or at least a very similar) issue.

I have ngs files from a CRISPR screen that I wan to analyse for guide abundance. I tried three options in general. Two are shown here:

1) I used cutadapt with different options:

-g CGAAACACCG
# untrimmed reads are kept, because MAGeCK will check for the guide later anyway

--trimmed-only -g CGAAACACCG
# untrimmed reads discarded; leads to the same results as without --trimmed-only; if adapter is found, it seems to be complete in all cases

--trimmed-only -g CGAAACACCG...GTTTTAGAGC
# 3' adapter also necessary; leads to fewer hits as the above cutadapt flags, but more then Joung et al.

2) I also checked bbduk.sh. This seems to be the most straight-forward way, but for some reasons, it never searches for all guides. The guides are presented in a fasta file, but I always get the output X kmers added and X is always lower then it should be (e.g. 77362 instead of 77441 for the Brunello library), even though py_fasta_validator returns 0:

bbduk.sh in=in.fastq.gz \
outm=matched.fq \
ref=brunello.fasta \
k=20 \
stats=stats.txt

I would not need the trimmed reads, because the stats file already has all the counts in it. bbduk finds the most guides, but it does not search for all. Also for bbduk.sh spaces in the path do not work. I tried escaping with \ and putting the path "'path with spaces'", but none does work.

Is there a way for cutadapt to check if only one or the other adapter is present? Is there a way to check, why bbduk.sh does not recognize all kmers from the fasta file?

Is there a gold standard how to count guides from NGS files, if the guide starts at different positions? Letting MAGeCK decide for itself does not work good in this case.

ADD REPLY • link 11 months ago by gernophil ▴ 120

0

Entering edit mode

Can you try lowering the k to k=9?

You don't have any duplicate sequences in your library file? That may be one reason why bbduk is reporting less.

Are you using a scheme where there is a variable length spacer on 5' side of gRNA? If so you could trim that sequence off using ktrim=l and then use MAGeCK on the left trimmed reads.

Letting MAGeCK decide for itself does not work good in this case.

Have you tried it?

ADD REPLY • link 11 months ago by GenoMax 150k

0

Entering edit mode

Can you try lowering the k to k=9?

Wouldn't that mess with the results since all my kmers are 20bp?

You don't have any duplicate sequences in your library file?

No, I don't I double checked that. I also checked the names and they are also all unique.

Are you using a scheme where there is a variable length spacer on 5' side of gRNA? If so you could trim that sequence off using ktrim=l and then use MAGeCK on the left trimmed reads.

That's basically what I try with cutadapt, but yes, maybe I could give bbduk.sh a chance for that, if I can figure out the fasta issue.

Have you tried it?

Yes, the results are okish, but some guides are not found this way

In general, all results are okish and show similar overall results, but I feel there must be a good way to capture all guides that are in the fastq file.

ADD REPLY • link 11 months ago by gernophil ▴ 120

0

Entering edit mode

I don't think every guide is supposed to show up in these experiments as far as I have seen. You will get some guides with 0 counts as you do with RNAseq data.

Wouldn't that mess with the results since all my kmers are 20bp?

Probably not you are looking to get that good initial match which is then going to be extended.

ADD REPLY • link 11 months ago by GenoMax 150k

0

Entering edit mode

I don't think every guide is supposed to show up in these experiments as far as I have seen. You will get some guides with 0 counts as you do with RNAseq data.

I think you got me wrong here. Sure, you will loose some guides during the whole process from library cloning to the final analysis. I just wanted to say, if you look at a fastq file you should be able to tell for every read definitely, if it has a perfect match for a guide in it and what guide it is. So, there should be a deterministic correct count for every guide.

Probably not you are looking to get that good initial match which is then going to be extended.

Hmm, So what I am trying to do with bbduk is to search directly for my guides, not for an adapter. If I reduce k it gives me tons of kmers. More then 10,000 for k=19. That's not really what I want with my strategy.

ADD REPLY • link 11 months ago by gernophil ▴ 120

0

Entering edit mode

if you look at a fastq file you should be able to tell for every read definitely, if it has a perfect match for a guide in it and what guide it is.

Not every read in your data is going to match a guide. There will be a set of odd reads that don't seem to match expected pattern. Is your data single or paired-end? MAGeCK will accept paired-end data (which should overlap).

So what I am trying to do with bbduk is to search directly for my guides,

Got it.

ADD REPLY • link 11 months ago by GenoMax 150k

0

Entering edit mode

Not every read in your data is going to match a guide.

That is also true for sure. Let me clarify what I mean. For every read you should be able to decide, if it has a guide in it or not. If no, then you just discard the whole read. If it has a guide, you add 1 to the total count of this guide. That's what I mean with "deterministic correct count for every guide". This could simply be done with two for loops in python (or any other language like:

for read in fastq:
    for guide in guides:
        if guide in read:
            specific_guidecount += 1

Of course, you should only find one guide per read and it's not really optimized for efficiency and so on, but this seems so simple that I wonder, why there is no implementation for it. And the chances of finding a guide by random in your NGS data is pretty low I'd say.

Is your data single or paired-end?

No, I have single end data :).

ADD REPLY • link 11 months ago by gernophil ▴ 120

0

Entering edit mode

If you know what the boundaries of your construct look like then trim the left-end of the read using the tag on that end (ktrim=l). Then do the same for the right end in a separate bbduk.sh step (ktrim=r) using reads remaining after step 1. At second step, only keep reads that are exactly 20 bp (or however long your gDNA are) after these two trimming operations.

You can then use clumpify.sh to get counts for unique sequences that remain. Compare them with your guide reference to see what is present.

ADD REPLY • link 11 months ago by GenoMax 150k