Two multifasta files with the same sequences, but different headers. How to change headers to match?
3
0
Entering edit mode
8.2 years ago
roblogan6 ▴ 50

I have a large multifasta file (about 125,000 sequences) and a smaller multifasta file (about 100 sequences). All sequences in the smaller multifasta file are found in the larger file, but the headers are different. I have many (thousands) of such smaller multifasta files. How can I search the larger file for the sequences found in the smaller and then exchange the header? I would ideally be able to print out a smaller multifasta file that would be identical to the one I started with, just with the headers found in the larger file. All sequences in both files have been linearized- that is, they are a single line. Thanks!

perl multifasta fasta • 2.9k views
ADD COMMENT
2
Entering edit mode
8.2 years ago
Eric Lim ★ 2.2k

The following might work for you.

grep -f other.fa reference.fa -B1 --no-group-separator

or

grep -f other.fa reference.fa -B1 | grep -v -- "^--$"

if --no-group-separator is not available in your version of grep.

Note that this will match substrings, which could be unwanted or undesired depending on your use case.

ADD COMMENT
0
Entering edit mode

Thanks! Is there any way to change the grep command to only print out the headers themselves, instead of also including the associated sequences? I am not very familiar with grep. Thanks again.

ADD REPLY
1
Entering edit mode

You could use grep "^>" , which would get you just the headers.

Looks like the original grep solution worked? I am curious since you had large files.

ADD REPLY
0
Entering edit mode

I couldn't do it locally because grep runs out of memory, but I can run it successfully over the server that I have access to. I need to learn more about grep, awk, sed, etc. They seem quick, powerful, and really simple. Maybe I am deceiving myself about the simple part though! Do you have any resources to suggest for learning more about these type of commands? Thanks again.

ADD REPLY
1
Entering edit mode
8.2 years ago

Solution of grep -B1 by @roblogan6 is very cool, however as he/she said, it matched substrings instead of whole sequences. Besides, sequences both in the big and small files must be in single-line format.

Here's is a robust preciser solution with SeqKit:

Big file:

$ cat big.f
>seq1
ACTACGACGTC
TAGCGTA
>seq2
CGACGATCTAC
GTAGCTAGAT
>seq3
ACGTCTGACGT
>seq4 containing seq3
ACGTACGTCTG
ACGTCC

Small file:

$ cat small.fa
>seq_abc
ACTACGACGTC
TAGCGTA
>seq_123
ACGTCTGACGT

Precisely matching by sequences:

$ seqkit grep -s -i -f <(seqkit seq -s -w 0 small.fa) -w 70 big.fa    
>seq1
ACTACGACGTCTAGCGTA
>seq3
ACGTCTGACGT

Here's the "long-option" version:

seqkit grep --by-seq --ignore-case --pattern-file <(seqkit seq --seq --line-width 0 small.fa) --line-width 70 big.fa
ADD COMMENT

Login before adding your answer.

Traffic: 1334 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6