Extract fasta sequences from a file using a list in another file.

5

Entering edit mode

6.9 years ago

EBP91 ▴ 50

I have a question concerning the extraction of sequences from a fasta file (>7000 sequences) using a reference .txt file with sequence headers. I have been playing around and been looking all over the internet to find a solution for this problem, but surprisingly, nothing really matches what I want to do. So, I have two files:

1) a fasta file which looks like this:

>Zotu1

ACTGACAAAGCA

TGCACGTCATTTT

>Zotu2

ATGCATCAGCATA

TGACCCCCGTTTA

>Zotu10

CGTCGAAAAATTT

CGATACACCCTAT

>Zotu22

CGTACGTCCCCTT

CGATATAATATATA

2) a .txt file with a list of sequence names:

Zotu1

Zotu2

Now, I want to use the .txt file to select sequences from the .fasta file. I have two semi-solutions that do part of the job.

OPTION 1:

cat list.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - sequences.fas > newfile.fas

Problem: this function gives me the full sequences, but extracts too many sequences since everything that partially matches the strings in the .txt file will be selected. In this case, it means that also Zotu10 and Zotu22 are selected.

OPTION 2:

grep -A 1 -wFf list.txt sequences.fas > newfile2.fas

Problem: this function correctly selects only the sequences that completely match the strings in the .txt file, but does not return the full fasta sequences, but only the part of the sequence on the first line. An output thus looks like this:

>Zotu1

ACTGACAAAGCA

>Zotu2

ATGCATCAGCATA

I tried combining both solutions but that somehow did not end well. I would be much helped by an elegant solution for this problem, preferably using the codes I already obtained.

Many thanks!

fasta header extract awk grep • 79k views

ADD COMMENT • link updated 4.3 years ago by Rashedul Islam ▴ 480 • written 6.9 years ago by EBP91 ▴ 50

0

Entering edit mode

Please format your fasta sequences appropriately, using the formatting button. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

code_formatting

ADD REPLY • link updated 5.4 years ago by GenoMax 151k • written 6.9 years ago by WouterDeCoster 48k

0

Entering edit mode

I just did that for the OP. But yes from next time he or she should do it. Thanks WouterDeCoster

ADD REPLY • link 6.9 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Thanks! I adjusted it a bit. It should be fine now.

ADD REPLY • link 6.9 years ago by EBP91 ▴ 50

0

Entering edit mode

Did you try to google it/ tried any solution?

I copy pasted your title and the first link I got is this: Extract fasta sequences from a large file using a list of names

ADD REPLY • link 6.9 years ago by venu 7.1k

0

Entering edit mode

Yes, I have been entirely through that thread (and several others) before posting here.

ADD REPLY • link 6.9 years ago by EBP91 ▴ 50

25

Entering edit mode

6.9 years ago

cpad0112 21k

with seqtk:

 $ seqtk subseq test.fa test.txt 
>Zotu1
ACTGACAAAGCATGCACGTCATTTT
>Zotu2
ATGCATCAGCATATGACCCCCGTTTA

with grep:

$ grep -w -A 2 -f  test.txt test.fa --no-group-separator
>Zotu1
ACTGACAAAGCA
TGCACGTCATTTT
>Zotu2
ATGCATCAGCATA
TGACCCCCGTTTA

inputs:

$ cat test.fa 
>Zotu1
ACTGACAAAGCA
TGCACGTCATTTT
>Zotu2
ATGCATCAGCATA
TGACCCCCGTTTA
>Zotu10
CGTCGAAAAATTT
CGATACACCCTAT
>Zotu22
CGTACGTCCCCTT
CGATATAATATATA

$ cat test.txt
Zotu1
Zotu2

ADD COMMENT • link 6.9 years ago by cpad0112 21k

3

Entering edit mode

grep -w -A 2 -f test.txt test.fa --no-group-separator doesn't work if there are special characters in the header, which is common. Use grep -w -A 2 -Ff test.txt test.fa --no-group-separator instead. -F searchers for a fixed string.

For anyone that only has one sequence per header in their fasta file, use grep -w -A 1 -f test.txt test.fa --no-group-separator instead.

ADD REPLY • link 5.4 years ago by alowi33 ▴ 50

1

Entering edit mode

What if the number of sequence lines under each header is different? I don't think so this would work.

ADD REPLY • link 5.1 years ago by Arindam Ghosh ▴ 540

0

Entering edit mode

Did the job! Thanks!

ADD REPLY • link 6.9 years ago by EBP91 ▴ 50

7

Entering edit mode

6.9 years ago

GenoMax 151k

Step 1: Get faSomeRecords utility from Jim Kent at UCSC. http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faSomeRecords (Linux link, OS X or source available).

Step 2: Make the file executable

$ chmod u+x faSomeRecords

Step 3: Run faSomeRecords

$ ./faSomeRecords
faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa

in.fa = Your sequence file

listfile = file with fasta headers/names (one per line)

out.fa = file to store the result

ADD COMMENT • link 6.9 years ago by GenoMax 151k

0

Entering edit mode

How can you make it work on piped files from zcat? eg. "zcat sequences.fas.gz | faSomeRecords ..."

ADD REPLY • link 5.1 years ago by fibar ▴ 90

0

Entering edit mode

You can use /dev/stdin:

zcat sequences.fas.gz | faSomeRecords /dev/stdin listFile out.fa

ADD REPLY • link 4.5 years ago by malte.ruehlemann • 0

7

Entering edit mode

6.9 years ago

finswimmer 16k

Strange (or funny) the third time today I recommended seqkit.

$ seqkit grep -n -f list.txt sequences.fas > newfile2.fas

should do the job.

fin swimmer

ADD COMMENT • link 6.9 years ago by finswimmer 16k

0

Entering edit mode

Does it work on piped files from zcat? eg. "zcat sequences.fas.gz | seqkit ...". Also, what is the difference with seqkit subseq from above?

ADD REPLY • link 5.1 years ago by fibar ▴ 90

6

Entering edit mode

6.9 years ago

Joe 22k

This has been asked a lot, so an existing solution will almost certainly match what you need to do.

You will usually need to linearise your fasta:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < myseqs.fasta > linear.fasta

then:

while read IDS ; do grep "\b$IDS\b" linear.fasta ; done < listofids.txt

You can use \b in the grep command to mark word boundaries so that Seq1 doesn't match Seq11 for example.

Then you can rewrap your sequences (partially) by using tr '\t' '\n'.

The code I use is below (using biopython though, so it is a more robust method).


	# Extract fasta files by their descriptors stored in a separate file.
	# Requires biopython

	from Bio import SeqIO
	import sys
	import argparse

	def getKeys(args):
	"""Turns the input key file into a list. May be memory intensive."""

	with open(args.keys, "r") as kfh:
	keys = []
	for line in kfh:
	line = line.rstrip('\n')
	line = line.lstrip('>')
	keys.append(line)

	return keys

	def main():
	"""Takes a string or list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
	# Parse arguments from the commandline:
	try:
	parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
	parser.add_argument(
	'-f',
	'--fasta',
	action='store',
	required=True,
	help='The multifasta to search.')
	parser.add_argument(
	'-k',
	'--keys',
	action='store',
	required=True,
	help='A string provided directly, or a file of header strings to search the multifasta for. Must be exact. Must be one per line.')
	parser.add_argument(
	'-o',
	'--outfile',
	action='store',
	default=None,
	help='Output file to store the new fasta sequences in. Just prints to screen by default.')
	parser.add_argument(
	'-v',
	'--verbose',
	action='store_true',
	help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
	parser.add_argument(
	'-i',
	'--invert',
	action='store_true',
	help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
	args = parser.parse_args()

	except:
	print('An exception occured with argument parsing. Check your provided options.')
	sys.exit(1)
	# Main code:
	# Call getKeys() to create the list of keys from the provided file:
	try:
	keys = getKeys(args.keys)

	# If -k/--keys was provided with a string, not a file path, the IO error is used as the indicator
	# to switch to expecting a string only, rather than a file.
	except IOError:
	keys = args.keys
	else:
	print("Couldn't determine a key from your provided file or string. Double check your file, or ensure your string is quoted correctly.")

	if args.verbose is not False:
	if args.invert is False:
	print('Fetching the following keys from: ' + inFile)
	for key in keys:
	print(key)
	else:
	print('Ignoring the following keys, and retreiving everything else from: ' + inFile)
	for key in keys:
	print(key)

	# Parse in the multifasta and assign an iterable variable:
	seqIter = SeqIO.parse(inFile, 'fasta')

	# For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
	for seq in seqIter:
	if args.invert is False:
	if seq.id in keys:
	print(seq.format("fasta"))
	if args.outfile is not None:
	SeqIO.write(seq, outFile, "fasta")
	else:
	# If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
	if seq.id not in keys:
	print(seq.format("fasta"))
	if args.outfile is not None:
	SeqIO.write(seq, outFile, "fasta")

	if __name__ == "__main__":
	main()

view raw fastafetcher.py hosted with ❤ by GitHub

ADD COMMENT • link 6.9 years ago by Joe 22k

0

Entering edit mode

4.3 years ago

Rashedul Islam ▴ 480

You can use this command:

while read line; do less fastaFile.fasta | seqkit grep -p $line; done < listFile >>out.fasta

ADD COMMENT • link 4.2 years ago by Rashedul Islam ▴ 480

Login before adding your answer.