How to concatenate FASTA files in a specific order based on sequence ID
4
1
Entering edit mode
3.7 years ago
Apex92 ▴ 300

I've got a problem that I don't have the scripting skills to solve (nor the time to gain them at the moment).

I have two fasta files and what I want to do is to merge them but in a way that similar sequence ids should be printed back to back - and any sequence ids in one of the files that do not exist in another file then those should not be printed.

cat f1.fa

>seq1
ATCGTCA
>seq2
AAAAACT
>seq3
AACATCA
>seq71
CCCGA

cat f2.fa

>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
AACGCGCATG
>seq81
AAACCCAGCGCATGCA

so the desired output should look like :

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG

I'd prefer to use python as that is the language I'm learning but any solution will suffice.

Thanks for helping.

sequence fasta awk python pipeline • 3.6k views
ADD COMMENT
3
Entering edit mode
3.7 years ago

using linearizefasta.awk

join -t $'\t' -1 1 -2 1 \
         <(awk -f linearizefasta.awk in1.fa | sort -t $'\t' -k1,1) \
         <(awk -f linearizefasta.awk in2.fa | sort -t $'\t' -k1,1) |\
awk '{printf("%s\n%s\n%s\n%s\n",$1,$2,$1,$3);}'
ADD COMMENT
0
Entering edit mode

nice one Pierre Lindenbaum , but for some reason it always gives a 'duplication' of the first entry of the first file ?

ADD REPLY
1
Entering edit mode

nice catch it's because a seq1 is present twice in the second file. So a uniqshould be added.

ADD REPLY
0
Entering edit mode

@Pierre thank you for your help - could you please also share how I can run it? I am very basic with this stuff, sorry

ADD REPLY
2
Entering edit mode
3.7 years ago
David Parry ▴ 150

The following script should do what you want as long as you don't have duplicate sequence names in the same file (as you do in f2.fa in your example but hopefully not in your real data). You will need to install the pyfaidx module (e.g. with pip3 install faidx --user).

#!/usr/bin/env python3
import sys
from pyfaidx import Fasta


def cat_fastas(*fasta_files):
    fas = list()
    all_contigs = set()
    for f in fasta_files:
        faidx = Fasta(f)
        fas.append(faidx)
        contigs = set(faidx.keys())
        all_contigs.update(contigs)
    for c in sorted(all_contigs):
        for f in fas:
            if c in f:
                print(">" + f[c].long_name)
                print(f[c])


if __name__ == '__main__':
    if len(sys.argv) < 3:
        sys.exit("Usage: {} f1.fa f2.fa ...fN.fa".format(sys.argv[0]))
    cat_fastas(*sys.argv[1:])

Because you technically shouldn't have two fasta records with the same sequence name in the same file you might want to consider checking for duplicates and adding a suffix to any duplicate sequences too.

EDIT:

I missed the part of your question

and any sequence ids in one of the files that do not exist in another file then those should not be printed

In which case:

#!/usr/bin/env python3
import sys
from pyfaidx import Fasta


def cat_fastas(f1, f2):
    fas1 = Fasta(f1)
    fas2 = Fasta(f2)
    overlapping_contigs = set(fas1.keys()).intersection(set(fas2.keys()))
    for c in sorted(overlapping_contigs):
        for f in (fas1, fas2):
            print(">" + f[c].long_name)
            print(f[c])


if __name__ == '__main__':
    if len(sys.argv) != 3:
        sys.exit("Usage: {} f1.fa f2.fa".format(sys.argv[0]))
    cat_fastas(*sys.argv[1:]

Should do what you want.

ADD COMMENT
1
Entering edit mode
3.7 years ago

if the order of the elements with the same names does not really matters you could do the following:

sort -V <(cat <yourFile1> |paste - -) <(cat <yourFile2> |paste - -) | sed 's/\t/\n/g'

[replace <yourFile> parts by your input fasta files]

the part between the brackets is a bash subprocess and will transform the fasta file to ID and seq on 1 line, for both files. Then sort will order the combination of both files and finally the sed part will transform the output back to fasta format (where ID on a line and next line is the seq)

OK, missed this part as well :

and any sequence ids in one of the files that do not exist in another file then those should not be printed

then it's all for Pierre Lindenbaum 's solution :)

ADD COMMENT
0
Entering edit mode

@lieven thank you for your help - indeed the order is important; meaning that the lines of in the f1 should appear first

ADD REPLY
1
Entering edit mode
3.7 years ago

with OP data

$ cat seq1.fa seq2.fa | paste - -  | sort -sk1,1 | tr "\t" "\n"                                                                                                                     

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq3
AACATCA
>seq71
CCCGA
>seq71
AACGCGCATG
>seq81
AAACCCAGCGCATGCA

If fasta files are not linearized, try with seqkit:

$ seqkit fx2tab seq1.fa seq2.fa | sort -sk1,1 | seqkit tab2fx                                                                                                                         

Round about is way due to limitations in seqkit sort. Seqkit doesn't allow duplicate headers while sorting.

ADD COMMENT
0
Entering edit mode

cpad0112 seq3 and seq 81 are not common between the two files and should not be printed - could you please help with that?

ADD REPLY
1
Entering edit mode
$ paste - - < seq1.fa < seq2.fa| sort -sk1,1 | awk -v OFS="\t" '{print $2,$1}' | uniq -D -f 1 | awk -v OFS="\n" '{print $2,$1}'

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG

seq 71 is common.

ADD REPLY
0
Entering edit mode

uniq: illegal option -- D usage: uniq [-c | -d | -u] [-i] [-f fields] [-s chars] [input [output]]

ADD REPLY
0
Entering edit mode

what OS are you using? You would need GNU coreutils 8.30 and I am on ubuntu 20.04 (focal)

ADD REPLY
0
Entering edit mode

my system is mac

ADD REPLY
0
Entering edit mode

I am not sure how I can fix this -D problem - is there another way of doing it? I am in a time shortage otherwise I would consider using GNU

ADD REPLY
0
Entering edit mode

see if this works (try it on example files in OP):

$ join -1 1 -2 1 <(paste - - <seq1.fa) <(paste - - <seq2.fa) -o 1.1 | uniq  | grep -A 1 --no-group-separator -hf - seq1.fa seq2.fa | paste - -  | sort -sk1,1 | awk -v OFS="\n" '{print $1,$2}'

>seq1
ATCGTCA
>seq1
AAAATCGCGCGCATG
>seq1
AAATAAAAACGCTCGGG
>seq2
AAAAACT
>seq2
TTAGCGCTAGCCCGCGCTCAGC
>seq71
CCCGA
>seq71
AACGCGCATG

join is due to common sequences in between seq1 and seq2.fa. If you want sequences from seq1.fa only,

$ grep ">" seq1.fa  | grep -A 1 --no-group-separator -hf - seq1.fa seq2.fa | paste - -  | sort -sk1,1 | awk -v OFS="\n" '{print $1,$2}'
ADD REPLY
0
Entering edit mode

No, not unfortunately it is not working - I get errors with options of join and grep

ADD REPLY
0
Entering edit mode

error with --no-group-separator

ADD REPLY

Login before adding your answer.

Traffic: 2622 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6