Question

Editing fasta headers with bash

0

Entering edit mode

7.0 years ago

James ▴ 20

Hi everyone, I have been trying to edit the sequence titles within a fasta file and have come up short. Please can someone help if they can. Sorry if this is a dupication. I've tried searching around. There ar elots of similar things but i don't know how to change them to my specific situation.

I am doing illumina full genome sequencing of influenza viruses. The script that processes the data outputs a comnsensus sequence fasta file. Within that file there are 8 sequences. The file has a name based on the sample number submitted to the sequencingguuys, and the sequences inside the file have titles that are actually the reference used to map data against. I want to rename the file, and all 8 sequences, with the same name, but the 8 segments also need to say which gene they are.

I have started writing a script but am now stuck, please can anyone point me in the right direction.

This is the script i started Not sure how this will display

#!/bin/bash
set -e

#script to copy, rename, and edit fasta files
#Needs an input file
fastafile=$1

# Get path from file
DIR=$(dirname "$fastafile")
echo "$DIR"

#echo information if script not used properly
if [ $# -ne 1 ]; then
    echo "Use: FluSeqOutputRename.sh <a fasta file you want renamed>
"
    exit 1
fi

echo "
    This script will edit "$fastafile"

    Re-naming the file and editing the fasta headers
"

#check how many sequences are in the fasta file
NoSeqs=$(grep -o '>' "$fastafile" | wc -l)

echo "
There are "$NoSeqs" gene segments
"

if [ "$NoSeqs" -ne 8 ]; then
echo "
There are NOT 8 gene segments

Please Start Again
"
exit 1
fi

#i might be able to just do it with 1 input IF i can replace the / with - later
read -p "Please enter the virus name: " virname
NewFileName=$(echo "$virname" | tr '-' '_' | tr '/' '-')

echo "$virname"
echo "$NewFileName"

read -p "Your file will be named "$NewFileName" is this correct y or n?: " correct

    if [ "$correct" = n ]; then
        echo "Please Start again"
        exit 1
        fi


echo "I am continuing to copy and edit your file"

#Now need to copy the file and rename it
cp -i -v "$fastafile" "$DIR"/"$NewFileName".fasta

#fastacopy="$DIR"/"$flutype-"$species-"$country"-"$identifier"-"$year".fasta
fastacopy="$DIR"/"$NewFileName".fasta
echo "The new file is 

"$fastacopy""

#I can make a new text file which is the header lines i want but does this even help?
echo "$virname" PB2 > "$DIR"/SequenceNames.txt
echo "$virname" PB1 >> "$DIR"/SequenceNames.txt
echo "$virname" PA >> "$DIR"/SequenceNames.txt
echo "$virname" HA >> "$DIR"/SequenceNames.txt
echo "$virname" NP >> "$DIR"/SequenceNames.txt
echo "$virname" NA >> "$DIR"/SequenceNames.txt
echo "$virname" MP >> "$DIR"/SequenceNames.txt
echo "$virname" NS >> "$DIR"/SequenceNames.txt

but thats as far as I can get. Any help gratefully received! Thank you James

fasta headers fasta bash script • 3.6k views

ADD COMMENT • link updated 7.0 years ago by Joe 22k • written 7.0 years ago by James ▴ 20

0

Entering edit mode

Hello James

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 7.0 years ago by GenoMax 153k

0

Entering edit mode

Many thanks genomax!

ADD REPLY • link 7.0 years ago by James ▴ 20

0

Entering edit mode

Hi James,

More useful than your code at this stage, would be examples of the input files, and the desired output format, since it's not super clear to me so far.

ADD REPLY • link 7.0 years ago by Joe 22k

0

Entering edit mode

Hi, My input file looks something like this Sample12-top_matches-iter4_consensus.fa

>KP288011-Influenza_seg1-iter4\n
AGAGATTTGATGTCGCAGTCTCGCACTCGCGAGATACTA\n
ATGGCCATAATCAAGAAATATACGTCAGGAAGACAGGAG\n
AAATGGATGATGGCAATGAAATATCCGATTACAGCAGAC\n
CCTGAAAGAAATGAGCAAGGTCAGACTCTTTGGAGCAAG\n
>KU143516-Influenza_seg2-iter4\n
TATGACTGGACACTGAACAGAAACCAACCAGC\n
GTGTTCAGATCGAATGGTCTGACAGCCAGTGA\n
GGATGTGATGGAATCAATGGATAAAGAAGAGAT\n
AAGAAGAGTGAGGGACAACATTACCAAGAAGAT\n
AAGCAGAGGCTGAACAAGAGGAGTTACTTAAT\n
>JX534572-Influenza_seg3-iter4\n
ACTTTGTGCGACAATGTTTCAATCCAATG\n
AATATGGGGAAGATCCGAAAATCGAAACA\n
GAAGTCTGTTTCATGTATTCAGACTTCCAC\n
TAGAATCTGGCGATCCGAACGCATTATTG\n
>KP732646-Influenza_seg4-iter4\n
TACATAGTGGAGAGGGATAAT\n
TGACTATGAAGAACTGAAACACCTG\n
TCATCCCTAAGAGTTCTTGGCCCAAT\n
TCCATACCAGGGAACGCCCTCCTTTTTC\n
>KX977633-Influenza_seg5-iter4\n
TCAGCATCATGGCGTCTCAAGGC\n
AGCGCCAGAATGCCACCGAGAT\n
AGGTTCTACATACAGATGTGCACT\n
>KX978671-Influenza_seg6-iter4\n
GAAGGTAATATGCATTTCAGCCACAGGA\n
AATTGCCAATTTGGGCCTAAACATCGGA\n
>EU735819-Influenza_seg7-iter4\n
CGAAACGTACGTTCTCTCTATTGTCCC\n
ACTTGAAGATGTCTTTGTAGGGAAGAA\n
GACAAGACCAATCCTGTCACCTCTGAC\n
GTGCCCAGTGAGCGAGGACTGCAGCG\n
TGGAGACCCAAACAATATGGACAGGGC\n
>CY028255-Influenza_seg8-iter4\n
GGTAGATTGCTTTCTTTGGT\n
CCCGTTCCTTGACCGGCTTC\n

And my desired output, because I end up in windows after this, would be

A-turkey-Scotland-123_456-2015.fasta

>A/turkey/Scotland/123-456/2015 PB2\r\n
AGAGATTTGATGTCGCAGTCTCGCACTCGCGAGATACTAATGGCCATAATCAAGAAATATACGTCAGGAAGACAGGAGAAATGGATGATGGCAATGAAATATCCGATTACAGCAGACCCTGAAAGAAATGAGCAAGGTCAGACTCTTTGGAGCAAG\r\n
>A/turkey/Scotland/123-456/2015 PB1\r\n
TATGACTGGACACTGAACAGAAACCAACCAGCGTGTTCAGATCGAATGGTCTGACAGCCAGTGAGGATGTGATGGAATCAATGGATAAAGAAGAGATAAGAAGAGTGAGGGACAACATTACCAAGAAGATAAGCAGAGGCTGAACAAGAGGAGTTACTTAAT\r\n
>A/turkey/Scotland/123-456/2015 PA\r\n
ACTTTGTGCGACAATGTTTCAATCCAATGAATATGGGGAAGATCCGAAAATCGAAACAGAAGTCTGTTTCATGTATTCAGACTTCCACTAGAATCTGGCGATCCGAACGCATTATTG\r\n
>A/turkey/Scotland/123-456/2015 HA\r\n
ACATAGTGGAGAGGGATAATTGACTATGAAGAACTGAAACACCTGTCATCCCTAAGAGTTCTTGGCCCAATTCCATACCAGGGAACGCCCTCCTTTTTC\r\n
>A/turkey/Scotland/123-456/2015 NP\r\n
TCAGCATCATGGCGTCTCAAGGCAGCGCCAGAATGCCACCGAGATCAGGTTCTACATACAGATGTGCACT\r\n\
>A/turkey/Scotland/123-456/2015 NA\r\n
GAAGGTAATATGCATTTCAGCCACAGGAAATTGCCAATTTGGGCCTAAACATCGGA\r\n
>A/turkey/Scotland/123-456/2015 MP\r\n
ACGTACGTTCTCTCTATTGTCCCACTTGAAGATGTCTTTGTAGGGAAGAAGACAAGACCAATCCTGTCACCTCTGAGTGCCCAGTGAGCGAGGACTGCAGCGTGGAGACCCAAACAATATGGACAGGGC\r\n
>A/turkey/Scotland/123-456/2015 NS\r\n
GGTAGATTGCTTTCTTTGGTCCCGTTCCTTGACCGGCTTC\r\n

ADD REPLY • link 7.0 years ago by James ▴ 20

0

Entering edit mode

I'd recommend quoting variables so:

#Now need to copy the file and rename it
cp -i -v "$fastafile" "${DIR}/${NewFileName}.fasta"

fastacopy="${DIR}/${NewFileName}.fasta"

ADD REPLY • link 7.0 years ago by Ram 45k

4

Entering edit mode

read -p "Please enter the virus name: " virname

"everytime you ask for an interaction or everytime you print your messages to stdout, god kills a kitten"

Minimum Standards For Bioinformatics Command Line Tools

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Maybe so, but maybe there are too many kittens in the world anyway. Can't think of any other way to tell the script what the name should be, it's not in another file. It needs to be made up by the operator. And I will remove most of the echo lines once I understand it's working. Was just using them to show me it was working how I expected :)

ADD REPLY • link 7.0 years ago by James ▴ 20

1

Entering edit mode

You do it as a flag to the command to start with :)

My answer shows a complicated/robust way to do it, but at its most basic it can be done using the 'special' variables" $1, $2, $3... etc

e.g:

$ bash myscript.sh arg1 arg2 arg3
           ^        ^    ^    ^
           $0       $1  $2   $3

So you could do:

bash script.sh virusname fasta.fa

and virusname == $1... etc.

ADD REPLY • link 7.0 years ago by Joe 22k

0

Entering edit mode

Thank Ram I'll make those edits

ADD REPLY • link 7.0 years ago by James ▴ 20

score 2 · Answer 1 · 2018-09-04

This is a little hacky because I'm missing something with sed about matching occurrence numbers rather than line numbers but I've given up!

NOTE: 2 caveats to this script functioning:

Your sequences must first be linearised if they aren't already. Here's a one liner which will let you do that: awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < myseqs.fasta
That your sequences always occur in the same order your tags are listed in.

Since you seem to want a 'front-end' to the script, I've gone ahead and made it nice and 'commandline-y'.

#!/bin/bash

usage(){
cat << EOF

usage: $0  -i <input_fasta> -n <name_string>

Redirect to an output file:

$0 -i <input_fasta> -n <name_string>  >  Output.fasta

Required:
   -i | --infile      Input fasta file
   -n | --name        Viral name for prepending.

Options:
   -h | --help        Show this message

EOF
}

for arg in "$@"; do
  shift
  case "$arg" in
        "--help")     set -- "$@" "-h"   ;;
        "--infile")   set -- "$@" "-i"   ;;
        "--outfile")  set -- "$@" "-o"   ;;
        "--name")     set -- "$@" "-n"   ;;
        *)            set -- "$@" "$arg" ;;
  esac
done
while getopts "hi:o:n:" OPTION ;do
  case $OPTION in
        i) infile=$OPTARG   ;;
        o) outfile=$OPTARG  ;;
        n) name=$OPTARG     ;;
        h) usage ; exit 0   ;;
  esac
done

numseqs=$(grep -o ">" "$infile" | wc -l)
if [ "$numseqs" != 8 ] ; then
 echo >&2 "There are $numseqs in the file: $infile"
 echo >&2 "There should only be 8 sequences. Check your starting file, and try again."
 exit 1
fi

if [[ -z "$infile" ]]; then
  usage ; echo "No input file provided. Exiting." ; exit 1
fi
if [[ -z "$name" ]]; then
  usage ; echo "No virus name string provided. Exiting." ; exit 1
fi

sed -e "1,/>/ s/^>.*/>${name}_PB2/" \
    -e "3,/>/ s/^>.*/>${name}_PB1/" \
    -e "5,/>/ s/^>.*/>${name}_PA/" \
    -e "7,/>/ s/^>.*/>${name}_HA/" \
    -e "9,/>/ s/^>.*/>${name}_NP/" \
    -e "11,/>/ s/^>.*/>${name}_NA/" \
    -e "13,/>/ s/^>.*/>${name}_MP/" \
    -e "15,/>/ s/^>.*/>${name}_NS/" "$infile"

As a final note, I believe this will only work with GNU sed - i.e. sed that ships with BSD (like MacOS) needs slightly different syntax.