Split multifasta-file into multiple files according to a pattern in header
5
1
Entering edit mode
3.9 years ago
6schulte ▴ 30

I am trying to split a multifasta file into several smaller mutlifasta files. This should be done according to a string in the header of each sequence included. As my sequences are the sequences from a annotated metagenom the headers are slightly different then they would normally be (see below).

Though I have seen several examples of a multifasta file being split into single fasta files for each entry in that "multi-file", I could not find a nice solution for my problem.

Particularly I'm trying to split a fasta file that looks something like this:

>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...

And I want to have two files like this:

File 1: non_partial (partial=00)

>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...

File 2: partial (partial=10, partial=01, partial=11)

>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...

My approach is to look if a header contains partial=00 I copy everything from that lines ">" (starting character) until the next ">" into a new file called "non_partial_sequences.fasta" the initial file is then left only with sequences that carry partial=10, partial=01 or partial=11 in the header. I then would rename it to "partial_sequences.fasta" . But this seems to be too complicated..

Hoping for good suggestions. Thanks in advance.

multifasta split pattern in header • 3.1k views
ADD COMMENT
1
Entering edit mode

My approach is to look if a header contains partial=00 I copy everything from that lines ">" (starting character) until the next ">" into a new file called "non_partial_sequences.fasta" the initial file is then left only with sequences that carry partial=10, partial=01 or partial=11 in the header. I then would rename it to "partial_sequences.fasta" . But this seems to be too complicated..

Far from being too complicated, this is the simplest and most efficient approach, and is how I would probably do it.

ADD REPLY
0
Entering edit mode

Are we to assume the sequence are in no particular order etc., and the partial=## string is the only differentiator?

ADD REPLY
2
Entering edit mode
3.9 years ago
GenoMax 148k

I will point you to Jim Kent's faSplit utility that can be found on this page (LINK). You could simply split every sequence (not sure how many there are) and then do some creative bash cat'ing to get the files you need.

That said sometimes for a very specific request like this there may be no alternative but to write some custom code.

ADD COMMENT
2
Entering edit mode
3.9 years ago
5heikki 11k

Assuming no linebreaks in sequences and the same header construct for all headers. This is likely the fastest way to do it as it requires just one pass. If you have mawk, use that over awk..

paste -d ";" - - <inputFile.fna | awk 'BEGIN{OFS=FS=";"}{print $1,$2,$3,$4"\n"$5>$2".fna"}'

Output:

partial=00.fna
partial=01.fna
partial=10.fna

If you don't want the "=" sign in your output files you can do e.g. this:

paste -d ";" - - <file | awk 'BEGIN{OFS=FS=";"}{a=$2;gsub("=","_",a);print $1,$2,$3,$4"\n"$5>a".fna"}'

Output:

partial_00.fna
partial_01.fna
partial_10.fna

Edit. I didn't read OP properly. You wanted this (still probably the fastest solution):

paste -d ";" - - <file | awk 'BEGIN{OFS=FS=";"}{if($2=="partial=00"){print $1,$2,$3,$4"\n"$5>"non_partial.fna"}else{print $1,$2,$3,$4"\n"$5>"partial.fna"}}'

Output:

partial.fna
non_partial.fna
ADD COMMENT
1
Entering edit mode
3.9 years ago
Fatima ▴ 1000

If your sequences are one and only one line you can use:

grep -A 1 ";partial=00;"  file > file1 

grep -A 1 -E ";partial=01;|;partial=10;|;partial=11;" file > file2

If your sequences occupy more that one line, you can use this command to convert your sequences:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  file > file.fasta

And then use file.fasta instead of file in the grep commands

ADD COMMENT
0
Entering edit mode
3.9 years ago

try seqkit:

$ seqkit grep -w 0 -rnip 'partial=00' file.fa  # for printing non-partial
$ seqkit grep -w 0 -vrnip 'partial=00' file.fa  # for printing partial

If fasta file is flattened:

$ awk -F '[=;]' -v OFS="\n" '/^>/ {getline seq} {if ($4==0) {print $0,seq > "np.fa"} else {print $0,seq > "p.fa"}}' input_file.fa

output:

$ cat np.fa 
>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...

$ cat p.fa 
>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...

Thanks @ 5heikki

ADD COMMENT
0
Entering edit mode
3.9 years ago
Dave Carlson ★ 2.1k

Here's python 3 solution using Biopython with regular expressions:

#!/usr/bin/env python

import argparse
import re
import sys
from Bio import SeqIO

parser = argparse.ArgumentParser(description="Search for pattern in fasta header and return sequences whose header includes (or does not include) the match")

parser.add_argument('--fasta', required=True, help='input fasta file', action='store')
parser.add_argument('--pattern', required=True, help='pattern to match in header', action='store')
parser.add_argument('--complement', required=False, default=False, help='Select only sequences that do not match the pattern', action='store_true')

args=parser.parse_args()

# read in sequences
records = list(SeqIO.parse(args.fasta, "fasta"))

# create empty lists
found_sequences = []
missing_sequences = []

# If --complement is provided, return sequences that do not match the search string. Otherwise return matches.

for record in records:

        if re.search(args.pattern, record.description):
                found_sequences.append(record)
        else:
                missing_sequences.append(record)

# If search pattern is not found in any header, print error message only
# If search pattern is found, print sequences that match (or do not match, if --complement argument is provided) to screen

if args.complement:
        if len(missing_sequences) == len(records):
                print("ERROR: Could not find " + args.pattern + " in any fasta header!", file=sys.stderr)
        else:
                print("Writing out %i sequences that DO NOT match the search string!" % len(missing_sequences), file=sys.stderr)
                SeqIO.write(missing_sequences, sys.stdout, "fasta")
else:
        print("Writing out %i sequences that match the search string!" % len(found_sequences), file=sys.stderr)
        SeqIO.write(found_sequences, sys.stdout, "fasta")

To get the sequences with "partial=00" in the header from a fasta file called "multi.fa", you could do:

python choose_fasta_by_header.py --fasta multi.fa --pattern partial=00 > partial00.fa

To get the sequences that don't have partial=00 in the header, do:

python choose_fasta_by_header.py --fasta multi.fa --pattern partial=00 --complement > notpartial00.fa

Note that this is probably not very efficient for large files because I have it reading in the whole fasta file into memory. I should fix that.

ADD COMMENT

Login before adding your answer.

Traffic: 2508 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6