Question

Split multifasta-file into multiple files according to a pattern in header

1

Entering edit mode

3.9 years ago

6schulte ▴ 30

I am trying to split a multifasta file into several smaller mutlifasta files. This should be done according to a string in the header of each sequence included. As my sequences are the sequences from a annotated metagenom the headers are slightly different then they would normally be (see below).

Though I have seen several examples of a multifasta file being split into single fasta files for each entry in that "multi-file", I could not find a nice solution for my problem.

Particularly I'm trying to split a fasta file that looks something like this:

>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...

And I want to have two files like this:

File 1: non_partial (partial=00)

>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...

File 2: partial (partial=10, partial=01, partial=11)

>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...

My approach is to look if a header contains partial=00 I copy everything from that lines ">" (starting character) until the next ">" into a new file called "non_partial_sequences.fasta" the initial file is then left only with sequences that carry partial=10, partial=01 or partial=11 in the header. I then would rename it to "partial_sequences.fasta" . But this seems to be too complicated..

Hoping for good suggestions. Thanks in advance.

multifasta split pattern in header • 3.1k views

ADD COMMENT • link updated 3.9 years ago by 5heikki 11k • written 3.9 years ago by 6schulte ▴ 30

1

Entering edit mode

My approach is to look if a header contains partial=00 I copy everything from that lines ">" (starting character) until the next ">" into a new file called "non_partial_sequences.fasta" the initial file is then left only with sequences that carry partial=10, partial=01 or partial=11 in the header. I then would rename it to "partial_sequences.fasta" . But this seems to be too complicated..

Far from being too complicated, this is the simplest and most efficient approach, and is how I would probably do it.

ADD REPLY • link 3.9 years ago by Joe 22k

0

Entering edit mode

Are we to assume the sequence are in no particular order etc., and the partial=## string is the only differentiator?

ADD REPLY • link 3.9 years ago by Joe 22k

score 2 · Answer 1 · 2021-02-04

I will point you to Jim Kent's faSplit utility that can be found on this page (LINK). You could simply split every sequence (not sure how many there are) and then do some creative bash cat'ing to get the files you need.

That said sometimes for a very specific request like this there may be no alternative but to write some custom code.

score 2 · Answer 2 · 2021-02-04

Assuming no linebreaks in sequences and the same header construct for all headers. This is likely the fastest way to do it as it requires just one pass. If you have mawk, use that over awk..

paste -d ";" - - <inputFile.fna | awk 'BEGIN{OFS=FS=";"}{print $1,$2,$3,$4"\n"$5>$2".fna"}'

Output:

partial=00.fna
partial=01.fna
partial=10.fna

If you don't want the "=" sign in your output files you can do e.g. this:

paste -d ";" - - <file | awk 'BEGIN{OFS=FS=";"}{a=$2;gsub("=","_",a);print $1,$2,$3,$4"\n"$5>a".fna"}'

Output:

partial_00.fna
partial_01.fna
partial_10.fna

Edit. I didn't read OP properly. You wanted this (still probably the fastest solution):

paste -d ";" - - <file | awk 'BEGIN{OFS=FS=";"}{if($2=="partial=00"){print $1,$2,$3,$4"\n"$5>"non_partial.fna"}else{print $1,$2,$3,$4"\n"$5>"partial.fna"}}'

Output:

partial.fna
non_partial.fna

score 1 · Answer 3 · 2021-02-04

If your sequences are one and only one line you can use:

grep -A 1 ";partial=00;"  file > file1 

grep -A 1 -E ";partial=01;|;partial=10;|;partial=11;" file > file2

If your sequences occupy more that one line, you can use this command to convert your sequences:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  file > file.fasta

And then use file.fasta instead of file in the grep commands

score 0 · Answer 4 · 2021-02-04

try seqkit:

$ seqkit grep -w 0 -rnip 'partial=00' file.fa  # for printing non-partial
$ seqkit grep -w 0 -vrnip 'partial=00' file.fa  # for printing partial

If fasta file is flattened:

$ awk -F '[=;]' -v OFS="\n" '/^>/ {getline seq} {if ($4==0) {print $0,seq > "np.fa"} else {print $0,seq > "p.fa"}}' input_file.fa

output:

$ cat np.fa 
>EL0073as #10 ID=0.1;partial=00;size_aa=7890;start_type=none
TAGGCAGGCGTGGGGGTTTGT....
>EL0305fe #101 ID=0.4;partial=00;size_aa=7890;start_type=none
GCTGACGGCAACGTTAG...

$ cat p.fa 
>EL00734864845r #570 ID=0.8;partial=01;size_aa=7890;start_type=none
CCTCTTCGGCCCTCA...
>EL0679495 #900 ID=0.9;partial=10;size_aa=7890;start_type=none
CAAGGACCGTTAGGGGC...

Thanks @ 5heikki

score 0 · Answer 5 · 2021-02-04

Here's python 3 solution using Biopython with regular expressions:

#!/usr/bin/env python

import argparse
import re
import sys
from Bio import SeqIO

parser = argparse.ArgumentParser(description="Search for pattern in fasta header and return sequences whose header includes (or does not include) the match")

parser.add_argument('--fasta', required=True, help='input fasta file', action='store')
parser.add_argument('--pattern', required=True, help='pattern to match in header', action='store')
parser.add_argument('--complement', required=False, default=False, help='Select only sequences that do not match the pattern', action='store_true')

args=parser.parse_args()

# read in sequences
records = list(SeqIO.parse(args.fasta, "fasta"))

# create empty lists
found_sequences = []
missing_sequences = []

# If --complement is provided, return sequences that do not match the search string. Otherwise return matches.

for record in records:

        if re.search(args.pattern, record.description):
                found_sequences.append(record)
        else:
                missing_sequences.append(record)

# If search pattern is not found in any header, print error message only
# If search pattern is found, print sequences that match (or do not match, if --complement argument is provided) to screen

if args.complement:
        if len(missing_sequences) == len(records):
                print("ERROR: Could not find " + args.pattern + " in any fasta header!", file=sys.stderr)
        else:
                print("Writing out %i sequences that DO NOT match the search string!" % len(missing_sequences), file=sys.stderr)
                SeqIO.write(missing_sequences, sys.stdout, "fasta")
else:
        print("Writing out %i sequences that match the search string!" % len(found_sequences), file=sys.stderr)
        SeqIO.write(found_sequences, sys.stdout, "fasta")

To get the sequences with "partial=00" in the header from a fasta file called "multi.fa", you could do:

python choose_fasta_by_header.py --fasta multi.fa --pattern partial=00 > partial00.fa

To get the sequences that don't have partial=00 in the header, do:

python choose_fasta_by_header.py --fasta multi.fa --pattern partial=00 --complement > notpartial00.fa

Note that this is probably not very efficient for large files because I have it reading in the whole fasta file into memory. I should fix that.