Extracting specific IDs + sequence from multifasta
4
0
Entering edit mode
7.2 years ago

Hey,

I got a multifasta file like this

>stuff1;[gene1];stuff,morestuff    
ATGGAGATAATAGATAGC
>stuff1;[gene2];stuff,morestuff    
ATGGAGATAATAGATAGC
>stuff2;[gene1];stuff,morestuff   
GTACTACATCGCTAGCACTACT
>stuff2;[gene2];stuff,morestuff    
GTAGTCATCAGCTACGACTACT

So between each ID and sequence is a new line. I want to extract e.q. all IDs and their sequences with [gene1], basically search the ID for a term and then extract ID and seq into a new fasta file with the filename of the extracted term. It is important that the complete ID is extracted, but the "search term" is just short ( in this case, [gene1])

I tried awk and grep

awk'/[gene1]/' RS='>' input.fasta > output.fasta

grep "[gene1]" input.fasta > output.fasta

But this just gave me all lines after [gene1] in both cases.

When searching for [gene1], i need a new multifasta like this:

>stuff1;[gene1];stuff,morestuff    
ATGGAGATAATAGATAGC
>stuff2;[gene1];stuff,morestuff   
GTACTACATCGCTAGCACTACT

Best Regards

multifasta • 13k views
ADD COMMENT
2
Entering edit mode

What do you mean by

So between each ID and sequence is a new line.

Do you mean empty new line? If the sequences do not contain new line character (i.e. are not folded) then you can try:

grep  -A1 "[gene1]" input.fasta > output.fasta
ADD REPLY
0
Entering edit mode

I meant that the header and the following seq are on seperate lines. There is no empty line between. Would it be easier if they were on the same line?

ADD REPLY
0
Entering edit mode

Ok, my problem was: since I searched for my term in brackets "[gene1]" instead of "gene1" it searched for all possible matchings, like: Is there g, ge, gen,and so on.

grep -A1 "term" input.fasta > output.fasta

is working

ADD REPLY
0
Entering edit mode

You may be interested in using SEDA (http://www.sing-group.org/seda/), which has different functions for extracting and filtering sequence identifiers. Regards.

ADD REPLY
0
Entering edit mode
grep "[gene1]"

or

awk /[gene1]/

are POSIX Bracket Expressions https://www.regular-expressions.info/posixbrackets.html .

$ echo 'e' | grep '[gene1]'
e

you want something like grep -A 1 --no-group-separator -F '[gene1]' if there is only one sequence after the header.

ADD REPLY
2
Entering edit mode
7.2 years ago

If you have long format of fasta file (the sequence is in one line), then you can use grep with -A1 option. Otherwise one clean solution would be with biopython SEQIO module.

ADD COMMENT
1
Entering edit mode
7.2 years ago

Ok, my problem was: since I searched for my term in brackets "[gene1]" instead of "gene1" it searched for all possible matchings, like: Is there g, ge, gen,and so on.

grep -A1 "term" input.fasta > output.fasta

is working for me.

ADD COMMENT
0
Entering edit mode
7.2 years ago
Joe 22k

You can use my general purpose fasta-fetching script below for this. Just give it a key, or a file full of keys that you want to retrieve. It's 'greedy' so it'll look for the key anywhere in the header, so just make sure your headers are all unique enough that you don't get any overlap (or tweak the code slightly e.g. if header == x rather than if x in header)

# Extract fasta files by their descriptors stored in a separate file.
# Requires biopython
from Bio import SeqIO
import sys
import argparse
def getKeys(args):
"""Turns the input key file into a list. May be memory intensive."""
with open(args.keys, "r") as kfh:
keys = []
for line in kfh:
line = line.rstrip('\n')
line = line.lstrip('>')
keys.append(line)
return keys
def main():
"""Takes a string or list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
# Parse arguments from the commandline:
try:
parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
parser.add_argument(
'-f',
'--fasta',
action='store',
required=True,
help='The multifasta to search.')
parser.add_argument(
'-k',
'--keys',
action='store',
required=True,
help='A string provided directly, or a file of header strings to search the multifasta for. Must be exact. Must be one per line.')
parser.add_argument(
'-o',
'--outfile',
action='store',
default=None,
help='Output file to store the new fasta sequences in. Just prints to screen by default.')
parser.add_argument(
'-v',
'--verbose',
action='store_true',
help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
parser.add_argument(
'-i',
'--invert',
action='store_true',
help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
args = parser.parse_args()
except:
print('An exception occured with argument parsing. Check your provided options.')
sys.exit(1)
# Main code:
# Call getKeys() to create the list of keys from the provided file:
try:
keys = getKeys(args.keys)
# If -k/--keys was provided with a string, not a file path, the IO error is used as the indicator
# to switch to expecting a string only, rather than a file.
except IOError:
keys = args.keys
else:
print("Couldn't determine a key from your provided file or string. Double check your file, or ensure your string is quoted correctly.")
if args.verbose is not False:
if args.invert is False:
print('Fetching the following keys from: ' + inFile)
for key in keys:
print(key)
else:
print('Ignoring the following keys, and retreiving everything else from: ' + inFile)
for key in keys:
print(key)
# Parse in the multifasta and assign an iterable variable:
seqIter = SeqIO.parse(inFile, 'fasta')
# For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
for seq in seqIter:
if args.invert is False:
if seq.id in keys:
print(seq.format("fasta"))
if args.outfile is not None:
SeqIO.write(seq, outFile, "fasta")
else:
# If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
if seq.id not in keys:
print(seq.format("fasta"))
if args.outfile is not None:
SeqIO.write(seq, outFile, "fasta")
if __name__ == "__main__":
main()
view raw fastafetcher.py hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode
7.2 years ago
Buffo ★ 2.4k

Save the code below as patt_to_fasta.py and run it:

patt_to_fasta.py file.fasta

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

import sys
if len(sys.argv) != 2:
    print syntax
    sys.exit()
prefix = sys.argv[1].split('.')[0]
seq = ''
count = 0
#you can automate searching using: pattern = raw_input(Write pattern to search in fasta names: ):
with open (sys.argv[1], 'r') as infile, open(prefix + '_newfa.fasta', 'w') as outfile: #and also this line for :with open (sys.argv[1], 'r') as infile, open(prefix + '_' + str(pattern) + '.fasta', 'w') as outfile:
    for line in infile:
        line = line.rstrip('\n')
        if line.startswith('>'):
            if seq:
                if '[gene1]' in name:                          #change this line if you want to do it for more than one pattern for: if str(pattern) in name:   
                    count += 1
                    outfile.write(name + '\n' + seq + '\n')
                seq  = ''
            name = line
        else:
            seq = seq + line
    if '[gene1]' in name:
        count += 1  
        outfile.write(name + '\n' + seq + '\n')
print '\n' + '\n' + 'File ' + prefix + '_newfa.fasta has been created..'
print 'This new fasta contains ' + str(count) + ' sequences...'

If you want to do it for more than one pattern change lines marked with '#'

ADD COMMENT

Login before adding your answer.

Traffic: 2168 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6