How to identify DNA sequences with ambiguous nucleotides such as N, Y, R, W.. in a multifasta file and then remove these sequences with Biopython
4
4
Entering edit mode
3.8 years ago
diego1530 ▴ 80

Dear Biostars,

My request is based on filtering and curing several multifastas. For instance, I have downloaded about 150 complete genomes from NCBI belonging to Influenza Virus that infect humans. Within these sequences, there are ambiguous nucleotides (W, S, K, M, Y, R, V, H, D, B, N), which are produced by sequencing process as following:

>H1N1_12
ATGCTTACTGGGTGATC
>H5N1
TTGCCRTCACCGNACTGC
>H1N1_9
CTGYNATTGCCATCGWAA
>H5N1_1
ATCTTACTCGGCGACTCC
>H5N1_5
ACTGYRATTCGCCTAKAA

With use of Biopython tools, I wish a script where it identifies these ambiguous nucleotides with its respective fasta header (in this case, >H5N1 has R and N, >H1N1_9 has Y, N and W, >H5N1_5 has Y, R and K) if finds whatever ambiguous nucleotide then it must remove the sequence with its ID in a new fasta file and else the process goes on until find it.

Cheers,

sequence biopython sequence analysis • 4.7k views
ADD COMMENT
0
Entering edit mode

Welcome to the forum, diegocastano182 .

While we are indeed here to help out, it is a good idea (think of it as a code of conduct) to at least show some effort yourself. Example by indicating what you may have tried or considered already before posting here.

Unfortunately (and please do not take this the wrong way) we are not here to do your work for you ;-) .

ADD REPLY
2
Entering edit mode

What about this? It's right?

from Bio import SeqIO
fastafile = "in.fasta"
outfastafile = "out.fasta"
records = []
for r in SeqIO.parse(fastafile, "fasta"):
if "N" in r.seq or "K" in r.seq or "Y" in r.seq or "R" in r.seq:
continue
else:
records.append(r)
SeqIO.write(records, outfastafile, "fasta")

Maybe, you could help me with edit it. Thanks

ADD REPLY
0
Entering edit mode

I don't know is there is a method to detect ambiguous bases with biopython. Though, you can write a script yourself if it is not an absolute necessary to use it.

Your code looks ok if it has indentations. If it is not working, could you send the error message, or the logically wrong result?

It is always better to send it as it is and with the result you got so that others do not try to guess what you are going through.

ADD REPLY
0
Entering edit mode

Try not to delete posts that have answers so that others may see them in the future if they have the same question.

ADD REPLY
3
Entering edit mode
3.8 years ago

non-python solution:

$ seqkit grep -srip '[WSKMYRVHDBN]' file.fa 
>H5N1
TTGCCRTCACCGNACTGC
>H1N1_9
CTGYNATTGCCATCGWAA
>H5N1_5
ACTGYRATTCGCCTAKAA

$ seqkit grep -vsrip '[WSKMYRVHDBN]' file.fa 
>H1N1_12
ATGCTTACTGGGTGATC
>H5N1_1
ATCTTACTCGGCGACTCC

with awk:

$ awk -v OFS="\n" -v RS=">" 'NF > 1 && $2 !~ /[WSKMYRVHDBN]/ {print ">"$1,$2}' file.fa 
>H1N1_12
ATGCTTACTGGGTGATC
>H5N1_1
ATCTTACTCGGCGACTCC

$ awk -v OFS="\n" -v RS=">" 'NF > 1 && $2 ~ /[WSKMYRVHDBN]/ {print ">"$1,$2}' file.fa 
>H5N1
TTGCCRTCACCGNACTGC
>H1N1_9
CTGYNATTGCCATCGWAA
>H5N1_5
ACTGYRATTCGCCTAKAA
ADD COMMENT
0
Entering edit mode

Hi cpad0112. Thanks for your help. Do you have a python script instead?

ADD REPLY
1
Entering edit mode
3.8 years ago
MSRS ▴ 590

Sequence cleaner will help in this regard.

usage: sequence_cleaner [-h] [-v] -q QUERY -o OUTPUT_DIRECTORY
                        [-ml MINIMUM_LENGTH] [-mn PERCENTAGE_N]
                        [--keep_all_duplicates] [--remove_ambiguous] [-l LOG]

Sequence Cleaner: Remove Duplicate Sequences, etc

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -q QUERY, --query QUERY
                        Path to directory with FAST(A/Q) files
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output files
  -ml MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
                        Minimum length allowed (default=0 - allows all the
                        lengths)
  -mn PERCENTAGE_N, --percentage_n PERCENTAGE_N
                        Percentage of N is allowed (default=100)
  --keep_all_duplicates
                        Keep All Duplicate Sequences
  --remove_ambiguous    Remove any sequence with ambiguous bases
  -l LOG, --log LOG     Path to log file (Default: STDOUT).
ADD COMMENT
0
Entering edit mode
3.8 years ago
JC 13k

Perl:

#!/usr/bin/perl

use strict;
use warnings;

$/="\n>"; # read fasta in blocks
while (<>) {
    s/>//g;
    my ($id, @seq) = split(/\n/, $_);
    my $seq = join("", @seq);
    if ($seq =~ /[^ACGT]/i) {
        warn "$id ignored\n";
    } else {
        print join "\n", ">$id", @seq;
        print "\n";
    }
}

Testing:

$ perl skipFa.pl < test.fa > filtered.fa
H5N1 ignored
H1N1_9 ignored
H5N1_5 ignored
$ cat filtered.fa
>H1N1_12
ATGCTTACTGGGTGATC
>H5N1_1
ATCTTACTCGGCGACTCC
ADD COMMENT
0
Entering edit mode
2.3 years ago
Biodeer ▴ 40

Bioaider may help you.

Function: detect and correct the ambiguous base based on MSA. (https://github.com/virologist/ambiguous_base_v2) enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6