How do I use a regex to remove a line from a FASTA file if it has more than 6 'A' or 6 'T' sequentially present in a line?
4
1
Entering edit mode
7.5 years ago
inkprs ▴ 70

I am looking for a solution in both Bash/Linux and Python.

Also,

If more than 40%[Aa]’s or [Tt]’s present, remove the sequence and the description.

e.g. file.fasta

>seq1
GGCAGAGGCCCCCTAGCCCCGCCCGCGCCATGGTCAGGCACGCCCCTCCTCATCGCGGGGCACAGCCCGGCGGGTAGCCCCAGCGCTGGAGGCGGGCGGGGCCGGCCGGCGGAGGCCTGAGCAGCAGCCCAGCGCGGGCCGCCGAGACACCATGAGAGCCCCCACACTCCTCGCCCCACCGGCCCTGGCCGCACTGGGCACCGCTGGCCGGGCGGGTGGGTGCCCC

>seq2
CCACTGCACTCACCGCACCCGGCCAATTTTTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACCTAAGACGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGTCCCGGGATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGACCCCTG

After processing, output should be:

>seq1
GGCAGAGGCCCCCTAGCCCCGCCCGCGCCATGGTCAGGCACGCCCCTCCTCATCGCGGGGCACAGCCCGGCGGGTAGCCCCAGCGCTGGAGGCGGGCGGGGCCGGCCGGCGGAGGCCTGAGCAGCAGCCCAGCGCGGGCCGCCGAGACACCATGAGAGCCCCCACACTCCTCGCCCCACCGGCCCTGGCCGCACTGGGCACCGCTGGCCGGGCGGGTGGGTGCCCC

seq1 is fine but seq2 and its description have to be removed from a file.

e.g.

AGAGCTAGAAGGGG - ok

AGAAAAAAAAGAGG - remove -- more than 6 A sequentially

ATTTTTTTTATGATG - remove -- more than 6 T sequentially

AGTAGTTAGGGGGG - ok

ACAGAAACAGAATG - remove -- more than 40% of A

TCTGATTTATTATTG - remove -- more than 40% of T

Are there any tools available on the market for these purposes? I would like to get a solution via tool as well use python/Linux.

FASTA FASTQ sequencing • 2.5k views
ADD COMMENT
1
Entering edit mode
grep  -iE '(A{6,}|T{6,})'
ADD REPLY
2
Entering edit mode
7.5 years ago

A pythonic way to achieve the same:

from Bio import SeqIO
import getopt, sys, re

def usage():
    print "Usage: remove_seqs.py -i <input_fasta> -o <output_fasta>"

try:
    options, remainder=getopt.getopt(sys.argv[1:], 'i:o:h')

except getopt.GetoptError as err:
    print str(err)
    usage()
    sys.exit()

for opt, arg in options:
    if opt in ('-i'):
        input_file=arg
    if opt in ('-h'):
        usage()
    sys.exit()
    elif opt in ('-o'):
        output_file=arg

out=open(output_file, 'w')

for record in SeqIO.parse(input_file, "fasta"):
    # check for 6 or more A's or T's
    search_hexamers=re.search('A{6,}|T{6,}',str(record.seq))
    # remove those sequences
    if not search_hexamers:
        # Calculate percentage of A's or T's 
        a = (sum(record.seq.count(x) for x in ['A'])/float(len(str(record.seq)))) * 100
        t = (sum(record.seq.count(x) for x in ['T'])/float(len(str(record.seq)))) * 100
        # Check of percentage of A's or T's is more than 40 % and remove
        if not a > 40 or t > 40:
            out.write(">"+record.id+"\n"+str(record.seq))
out.close()

USAGE

remove_seqs.py -i <input_fasta> -o <output_fasta>

Dependency: Biopython

Input:

>ok
AGAGCTAGAAGG
>remove
AGAAAAAAAAGG 
>remove
ATTTTTTTTAGGA
>ok
AGTAGTTAGGGG

Output

>ok
AGTAGTTAGGGG

Note:

>ok
AGAGCTAGAAGG

above sequence was removed as the percent of A's crossed 40%

ADD COMMENT
0
Entering edit mode

Hi,

It is deleting some portion of id part in result FASTA file. All result fasta file has id portion being truncated.

e.g. 1:N:0:TACGCATT has been deleted from below id: (see last portion been deleted)

>ABCDEFG01:240:B43AWACQX:1:1301:1421:1128 1:N:0:TACGCATT
ADD REPLY
0
Entering edit mode

Kindly replace this

out.write(">"+record.id+"\n"+str(record.seq))

with this

out.write(">"+record.description+"\n"+str(record.seq))
ADD REPLY
0
Entering edit mode

That works perfectly, thanks.

What regex should I use if I want to search for a sequence of more than 6A's and 6T's only in beginning and end of the string?

 from: search_hexamers=re.search('A{6,}|T{6,}',str(record.seq))
 to: ?
ADD REPLY
1
Entering edit mode
7.5 years ago

BBDuk will handle hexamers in single-line or multi-line fasta:

bbduk.sh in=file.fa out=filtered.fa k=6 mm=f literal=AAAAAA

You can also add "mingc=0.6" to remove anything with more than 40% A or T combined. There is no flag for A or T individually. However, you could accomplish filtering based on 40% of the bases being A or 40% of the bases being T individually, like this:

reformat.sh in=file.fa out=AC.fa remap=AC
reformat.sh in=file.fa out=TC.fa remap=TC
reformat.sh in=AC.fa out=filteredAC.fa maxgc=0.6
reformat.sh in=TC.fa out=filteredTC.fa maxgc=0.6
cat filteredAC.fa filteredTC.fa > filteredACTC.fa
filterbyname.sh in=file.fa out=filtered.fa names=filteredACTC.fa exclude

That first changes all of the A's to C's, for example, so that only T will be counted as AT content, then does the gc-filtering at 60%, and does the same again for T to C. Then the names of the resulting files are used by filterbyname to exclude all the sequences with at least 40% of A or 40% of T.

ADD COMMENT
1
Entering edit mode
7.5 years ago

You could linearize the FASTA, then pipe to a second awk statement to test the regex and the overall A/T content:

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | awk '{ n=split($2,s,""); a=0; t=0; for(l in s){ if(s[l]=="A" || s[l]=="a") {a+=1;} if (s[l]=="T" || s[l]=="t") {t+=1;}} if (!match($2, /(A{6,}|T{6,})/) && (a/n < 0.4) && (t/n < 0.4)) { print $0; }}' | tr "\t" "\n"
>seq1
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGTATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACCATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGTGCCCC
ADD COMMENT
1
Entering edit mode
7.5 years ago

An ugly but working one-line solution supporting both Windows and Linux using seqkit and csvtk:

$ cat seqs.fa 
>ok
AGAGCTAGAAGG
>remove
AGAAAAAAAAGG 
>remove
ATTTTTTTTAGGA
>ok
AGTAGTTAGGGG

$ seqkit grep -s -i -r -p "\"A{6,}\"" -p "\"T{6,}\"" -v  seqs.fa | seqkit fx2tab -B A,T | csvtk filter2 -H -t -f "! (\$4 > 40 || \$5 > 40)" | seqkit tab2fx
>ok
AGTAGTTAGGGG

The result is right as Vijay Lakhujani's.

The ugly "\"A{6,}\"" is for compatibility in Windows, the more readable version in Linux/OS X:

seqkit grep -s -i -r -p '"A{6,}"' -p '"T{6,}"' -v  seqs.fa \
    | seqkit fx2tab -B A,T \
    | csvtk filter2 -H -t -f '!($4 > 40 || $5 > 40)' \
    | seqkit tab2fx

Explain:

  1. excluding sequences with 6'A' or 6'T' by searching sequences with regular expression.
  2. converting FASTA format to tabular format, custom base content of A and T are outputted in 4th and 5th columns.
  3. excluding sequences with A/T of >40%. awk can do this too.
  4. converting tabular format back to FASTA format
ADD COMMENT

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6