Printing specific sequence and ID from combined fasta file using bash commands

2

Entering edit mode

7.7 years ago

SaltedPork ▴ 170

I have a combined fasta file with all my sequences. I want to print the ID lines and their DNA sequence that end in .4

So far I have awk /'>*.4'/ {getline;print} combined.fasta Which prints the sequences that I want, how do I get the ID lines as well?

fasta bash • 4.0k views

ADD COMMENT • link updated 7.7 years ago by Pierre Lindenbaum 166k • written 7.7 years ago by SaltedPork ▴ 170

0

Entering edit mode

Can you show an example of your input data and ideal output?

ADD REPLY • link 7.7 years ago by Joe 22k

0

Entering edit mode

Thanks for responding, I have ID's that end in .1, .2 and so on up until .8. I just want the .4's and their sequence.

Input:

>16U035667-26_S26_L001.1
AGCTACGT
>16U035667-26_S26_L001.2
AGCTAACGTAC
>16U035667-26_S26_L001.4
ACGTACGTACTGAC

Output:

>16U035667-26_S26_L001.4
ACGTACGTACTGAC

ADD REPLY • link 7.7 years ago by SaltedPork ▴ 170

2

Entering edit mode

7.7 years ago

Joe 22k

Since you requested bash only:

#!/bin/bash

string="$2"

while read line ; do
    if [[ ${line:0:1} == ">" ]] ; then
        header="$line"
    else
        seq="$line"
        if [[ "$header" == *"$string" ]]; then
            echo -e "$header""\n""$seq"
        fi
    fi
done < $1

Put it in a script file and run it like:

$ bash parseheaders.sh seqs.fasta .4

Some of my own test data:

$ cat seqs.fasta
>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC
>tpg|Pyricularia_pennisetigena|AB818016
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC

$ bash parseheaders.sh seqs.fasta 947
>tpg|Inocybe_sororia|EU525947
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC

ADD COMMENT • link 7.7 years ago by Joe 22k

1

Entering edit mode

This works very well, many thanks! PS. also didn't need to put quotes around the .4.

ADD REPLY • link 7.7 years ago by SaltedPork ▴ 170

2

Entering edit mode

You can accept multiple answers (green check mark). So you should accept this one one as well.

ADD REPLY • link 7.7 years ago by GenoMax 152k

1

Entering edit mode

Just as a final comment though, I'd advise following some of the other suggestions here and use proper parsers like bioperl, biopython, bioawk and so on. My own personal go-to script for pulling out sequences is here, using Biopython (though it finds the key of interest anywhere in the header, not just at the end, so it wasn't directly applicable to this case.


	# Extract fasta files by their descriptors stored in a separate file.
	# Requires biopython

	from Bio import SeqIO
	import sys
	import argparse

	def getKeys(args):
	"""Turns the input key file into a list. May be memory intensive."""

	with open(args.keys, "r") as kfh:
	keys = []
	for line in kfh:
	line = line.rstrip('\n')
	line = line.lstrip('>')
	keys.append(line)

	return keys

	def main():
	"""Takes a string or list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
	# Parse arguments from the commandline:
	try:
	parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
	parser.add_argument(
	'-f',
	'--fasta',
	action='store',
	required=True,
	help='The multifasta to search.')
	parser.add_argument(
	'-k',
	'--keys',
	action='store',
	required=True,
	help='A string provided directly, or a file of header strings to search the multifasta for. Must be exact. Must be one per line.')
	parser.add_argument(
	'-o',
	'--outfile',
	action='store',
	default=None,
	help='Output file to store the new fasta sequences in. Just prints to screen by default.')
	parser.add_argument(
	'-v',
	'--verbose',
	action='store_true',
	help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
	parser.add_argument(
	'-i',
	'--invert',
	action='store_true',
	help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
	args = parser.parse_args()

	except:
	print('An exception occured with argument parsing. Check your provided options.')
	sys.exit(1)
	# Main code:
	# Call getKeys() to create the list of keys from the provided file:
	try:
	keys = getKeys(args.keys)

	# If -k/--keys was provided with a string, not a file path, the IO error is used as the indicator
	# to switch to expecting a string only, rather than a file.
	except IOError:
	keys = args.keys
	else:
	print("Couldn't determine a key from your provided file or string. Double check your file, or ensure your string is quoted correctly.")

	if args.verbose is not False:
	if args.invert is False:
	print('Fetching the following keys from: ' + inFile)
	for key in keys:
	print(key)
	else:
	print('Ignoring the following keys, and retreiving everything else from: ' + inFile)
	for key in keys:
	print(key)

	# Parse in the multifasta and assign an iterable variable:
	seqIter = SeqIO.parse(inFile, 'fasta')

	# For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
	for seq in seqIter:
	if args.invert is False:
	if seq.id in keys:
	print(seq.format("fasta"))
	if args.outfile is not None:
	SeqIO.write(seq, outFile, "fasta")
	else:
	# If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
	if seq.id not in keys:
	print(seq.format("fasta"))
	if args.outfile is not None:
	SeqIO.write(seq, outFile, "fasta")

	if __name__ == "__main__":
	main()

view raw fastafetcher.py hosted with ❤ by GitHub

ADD REPLY • link 7.7 years ago by Joe 22k

0

Entering edit mode

It shouldn't, but in case the . doesn't play nicely with your terminal, just enclose the string to look for in quotes: ".4"

ADD REPLY • link 7.7 years ago by Joe 22k

2

Entering edit mode

7.7 years ago

Pierre Lindenbaum 166k

grep '^>.*4$' -A 1 --no-group-separator in.fa

ADD COMMENT • link 7.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Fantastic one-liner, many thanks!

ADD REPLY • link 7.7 years ago by SaltedPork ▴ 170

1

Entering edit mode

I believe this will only work so long as your sequences are single lines FYI

ADD REPLY • link 7.7 years ago by Joe 22k

2

Entering edit mode

7.7 years ago

Sean Davis 27k

How about:

awk '/>*.4/ {print $0; getline; print}' combined.fasta

ADD COMMENT • link 7.7 years ago by Sean Davis 27k

0

Entering edit mode

hmmmm, I just get everything from combined.fasta printed to terminal.

ADD REPLY • link 7.7 years ago by SaltedPork ▴ 170

1

Entering edit mode

I'm on Mac OS. You may need to adjust the syntax a bit. Sorry....

ADD REPLY • link 7.7 years ago by Sean Davis 27k

0

Entering edit mode

Odd. Works fine on CentOS.

ADD REPLY • link 7.7 years ago by GenoMax 152k

1

Entering edit mode