filter DNA sequences of defined length for certain bases at exact positions
3
0
Entering edit mode
7.4 years ago

Dear all,

I was wondering if there is a possibility to filter DNA sequences of defined length (lets say about 30 bp) forthe presence of a specific base at a defined positin.

Something like: Give me only the sequences which contain a G at position 5 and an A at position 20.

Maybe this tool is so trivial that it was never implemented in galaxy or I am not seraching with the right key words.

does anybody have an idea?

Thank you very much,

Phil

sequence • 1.5k views
ADD COMMENT
0
Entering edit mode

What is an input file? FASTA?

ADD REPLY
0
Entering edit mode

dear all, thank you very much,

the input file would be a simple fasta and the length of the seuqences is always the same.

with my very limited understanding of programming (I have never done this before) I will try to get this working on a windows PC...

CHeers,

phil

ADD REPLY
0
Entering edit mode

Please if you can mark answer or up-vote any solution above.

ADD REPLY
0
Entering edit mode

philipp.rathert : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

cpad0112 's solution may be the only one that will work natively on windows. Other two solution s will require python (@st.ph.n) and a virtual unix environment (@Pierre).

ADD REPLY
0
Entering edit mode
7.4 years ago

linearize and filter with awk;

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '(substr($2,5,1)=="G") && (substr($2,20,1)=="A")' |\
tr "\t" "\n"
ADD COMMENT
0
Entering edit mode
7.4 years ago
st.ph.n ★ 2.7k

Assuming single-line fasta input, or linearize with Pierre's awk statement. Each sequence has to have G @ 5 and A @ 20, and have a length of 30bp. If you want to know how many has G+A, G only, and A only you should find it easy to edit this code to do so.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
    for line in f:
        if line.startswith('>'):
            h = line.strip()
            s = next(f).strip()
            if len(s) == 30:
                    if s[5] == 'G' and s[20] == 'A':
                         print h, '\n', s
ADD COMMENT
0
Entering edit mode
7.4 years ago

my fasta:

>seq1
ACGTACTGCACC
>seq2
CTGATTTAGTCTATATGTATACT
>seq3
ATGACTATATCGTACGATCTAGTCCC

Requirements:

filter a sequence by length (12 bases in this example) and sequence should have G at the 3d position and C at the 12 position

Command:

$ seqkit seq -M 12  test.fa | seqkit grep -s  -R 3:3 -p G | seqkit grep -s -R 12:12 -p C

output:

$ seqkit seq -M 12  test.fa | seqkit grep -s  -R 3:3 -p G | seqkit grep -s -R 12:12 -p C

 >seq1
 ACGTACTGCACC
ADD COMMENT
0
Entering edit mode

cpad0112 : Please include links to download seqkit in future answers. Everyone may not be aware of this tool.

ADD REPLY

Login before adding your answer.

Traffic: 984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6