look for sequences containing a specific motif
1
0
Entering edit mode
6.9 years ago

Hi everyone,

I am sure this is rather easy to do but I'm no bioinformatician and I've been struggling with this for the last few days so here I am finally asking for your help/advices.

I have generated RNA-seq data with the minion from nanopore and I'm now trying to analyze them. From the fastq file I obtained, i performed alignment of my reads with minimap2 onto my genome of reference and obtained BAM/SAM files. Now, I am trying to determine the number of reads aligned on the genome that contains a specific motif of 24bp (normally located at the 5' end of all of my RNAs). I was able to write a very basic python script to search for the EXACT sequence (or the reverse complement) but could only get something like 50k hits among 1.5millions reads (stored in a multi-fasta file).

    from Bio.Seq import Seq
from Bio import SeqIO

SEQ = Seq("XXXXXXXXXXXXXXXX") #Sequence to search for
SEQrev = SEQ.reverse_complement() 

reads = [] 

for rec in SeqIO.parse("reads.fasta","fasta"): 
    if rec.seq.count(SEQ) == 1:              
        reads.append(rec)             
    if rec.seq.count(SEQrev) == 1:
        reads.append(rec)

print("there is " + str(len(reads)) + " sequences containing SEQ")

Knowing the minion is prone to errors I would like to find a way to search for my sequence allowing mismatches. I've been looking at others posts and trying things like running a local blast (my multi-fasta file containing all my reads being my database) or using ClustalW but none seem to have worked so far (or I failed to use them correctly).

Could you give me any advice on how to do that easily or guide me toward a script that could allow me to do it ? If you tell me to insist with blast or clustal then I will really dig into it in order to fix my mistakes or find what is not working but I'd rather have your advice before starting spending time on something that could end up not being worth my time !

I am all learning by myself trough trials and errors (and a lot of internet posts) so a little push in the right direction would be very welcomed ! Thanks in advance.

RNA-Seq nanopore sequence blast clustalw • 3.0k views
ADD COMMENT
5
Entering edit mode
6.9 years ago

An easy method would be to use BBMap's kmer counting functionality, plus hamming distance to allow for errors:

    bbduk.sh in=YOUR.FASTQ outm=MATCHED.FASTQ \
literal=SEARCH_STRING k=STRING_LENGTH hdist=NUMBER_OF_ERRORS
ADD COMMENT
0
Entering edit mode

Thanks for your answer, seems like it's the perfect tool for that. I have one small problem though. I tried running this :

bbduk.sh in=reads.fasta outm=matches3.fasta literal=CTCAAACTTGGGTAATTAAACC k=22 hdist=3

And I also tried running this, with the reverse complement sequence :

bbduk.sh in=reads.fasta outm=matches1.fasta literal=GGTTTAATTACCCAAGTTTGAG k=22 hdist=3

And I also tried specifying both sequences at the same time with literal=GGTTTAATTACCCAAGTTTGAG,CTCAAACTTGGGTAATTAAACC but the 3 scripts gave me back the exact same number of sequences .. which I find very odd. Any suggestions or ideas as to why is that ?

ADD REPLY
0
Entering edit mode

From the user guide:

You can specify whether or not BBDuk looks for the reverse-complement of the reference sequences as well as the forward sequence with the flag “rcomp=t” or “rcomp=f”; by default it looks for both.

ADD REPLY
0
Entering edit mode

I eventually figured that's what it was doing but I couldn't get why ! Thanks a lot, I had miss that part. Great tool you provided me here, thanks again !

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6