Find motif in reads and delete everything before / extract everything after

0

Entering edit mode

5.8 years ago

zack.saud ▴ 50

Hi all,

I have sequence some plasmid DNA using Nanopore sequencing (whole plasmid was sequenced). I am only interested in one region, therefore I would like to extract from the fasta file (around 10000 reads) just 425 bp from a specified location in each 4600 bp read (i.e. everything after atgacccg to be kept, or delete everything before this sequence).

Would someone be as kind as to help me do this? I've tried a few tools, but they fail with the long nanopore reads.

Many thanks in advance

next-gen sequencing • 1.4k views

ADD COMMENT • link updated 5.8 years ago by mbelmadani ★ 1.4k • written 5.8 years ago by zack.saud ▴ 50

2

Entering edit mode

5.8 years ago

mbelmadani ★ 1.4k

Any tool using regular expression supporting groups ( a part of a regex typically wrapped in at ( ) that can be referenced with \1 later) should work.

Example using sed:

$ cat my.fasta

>header1
TGTCATCGATCATCTGACTGATGCTGACTGACTGACTGATGCTGTGATGCATGCATGCTGACTGACTGACTGACTGatgacccgACCGGGTTTTAAAAACCCCCCGGGGGGGTTTTTTTTaaaaaaaaaaaaa

Extract the next 3 characters after atgacccg:

sed -e 's|.*atgacccg\(.\{3\}\).*|\1|g' my.fasta
>header1
ACC

Here the syntax of sed means substitute anything that has atgacccg, capture the group with is .{3}, meaning 3 of anything (you would use 425 in your case) and then followed by anything. The second part with \1 means replace the matched string with group #1. In the case where there is not match (like your header), there is not replacement.

One caveat: Make sure your fasta sequence is continuous and not broken on multiple lines. See here:

Linearize a fasta sequence

awk -f linearizefasta.awk < input.fa

or

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

Format back to fasta

tr "\t" "\n" < linearized.tsv

if you know your fasta header have a length < 60

tr "\t" "\n" < linearized.tsv | fold -w 60

view raw README.md hosted with ❤ by GitHub

	/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;}
	{printf("%s",$0);}
	END {printf("\n");}

view raw linearizefasta.awk hosted with ❤ by GitHub

, you could adapt something like that by adding a pipe to sed before formatting back to fasta.

ADD COMMENT • link 5.8 years ago by mbelmadani ★ 1.4k

1

Entering edit mode

Hi, Many thanks for this, it worked like a charm! If I could ask for your help just one more time, the data is a little bit noisy, and it appears that it is retaining reads that do not contain this sequence. Would you happen to know how to modify the above script to discard any reads that do not contain the specified sequence? Many thanks again zack

ADD REPLY • link 5.8 years ago by zack.saud ▴ 50

0

Entering edit mode

When you linearize, you could prefix the input to the sed command with a grep for atgacccg, i.e.

grep "atgacccg"  my_linearized.fasta  |  sed -e 's|.*atgacccg\(.\{3\}\).*|\1|g' my.fasta > only_matching_atgacccg.fasta

That way only sequences matching atgacccg are retained and processed by sed.

ADD REPLY • link 5.8 years ago by mbelmadani ★ 1.4k

Login before adding your answer.