Add filename to fasta headers of multiple fasta files inside loop
2
1
Entering edit mode
7.3 years ago
bioinfo8 ▴ 230

Hi,

I have 10 fasta files (each file with 20 gene sequences from each of the 10 samples). I would like to create 20 files, specific to each gene from 10 samples. I proceeded as follows to extract genes with the file_name in header:

pyfasta extract --header --fasta test.fasta gene_name1 | awk '/^>/ {$0=$0 "_sample1"}1' > gene_name1.fasta

Output:

>gene_name1_sample1
ATGC

I am successful in creating multiple gene fasta files for each gene from each sample (a part from loop):

 pyfasta extract --header --fasta $sample.fasta gene_name1 >> gene_name1.fasta 
 pyfasta extract --header --fasta $sample.fasta gene_name2 >> gene_name2.fasta

But, I am unable to add file_name to the header of files in loop (but can do for 1 file as mentioned in the beginning).

Kindly guide.

Thanks.

pyfasta header fasta bash gene • 7.3k views
ADD COMMENT
0
Entering edit mode

Can you post an example of some of the headers from one of the sample fasta? Is each sample fasta a multi-line or single line fasta? Is each header, per gene, identical across sample fastas?

ADD REPLY
0
Entering edit mode

Header from sample1.fasta file

 >gene_name1
 ATGC..............................max upto 120 characters per line
 TTTG..............................................................
 >gene_name2
 ATGG..............................................................

 ....

Like this upto 20 gene names and their sequences. Each sample fasta files are multi-line. Gene name is same across sample fastas.

ADD REPLY
2
Entering edit mode
7.3 years ago
st.ph.n ★ 2.7k

The question is pretty generalized with the use of terms. From what I understand, this should work:

If each sample file are not single line fasta, linearize, on commandline:

for file in *.fasta; do awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $file > "`basename $file .fasta`_single-line.fasta"; done

Then run this python script:

#!/usr/bin/env python

import glob
from collections import defaultdict

genes = defaultdict(list)

for file in glob.glob('*_single-line.fasta'):
    with open(file, 'r') as f:
        for line in f:
            if line.startswith(">"):
                # Gene ID : (sample name, sequence)
                genes[line.strip().split('>')[1]].append((file.split('.fasta')[0], next(f).strip()))

for g in genes:
    # open fasta per gene name
    with open(g + '.fasta', 'w') as out:
        # for each sample name and sequence
        for n in range(len(genes[g])):
            # print to file, >gene_sample, sequence
            out.write('>' + g + '_' + genes[g][n][0] + '\n' + genes[g][n][1])
ADD COMMENT
0
Entering edit mode

Thanks. But, I want to combine with pyfasta so that I can extract gene from each sample file + add sample name (file name) to header in one step. Can you suggest some easier way, such as awk, sed etc. As awk is working for one file and I did not do any linerization eventhough that file is multi-line fasta.

ADD REPLY
0
Entering edit mode

I want to combine with pyfasta so that I can extract gene from each sample file + add sample name (file name) to header in one step

OK, so this is clearly a two step process, however the code can be converted to not have to use the awk statement first, and simply input the multi-line fasta.

ADD REPLY
0
Entering edit mode

@st.ph.n I explained my query in more detail in above comment. I hope its more clear now.

ADD REPLY
1
Entering edit mode

Let's first check if the above does what you need.

Put this into a bash script, run.sh:

#!/usr/bin/bash

for file in *.fasta; do 
    awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $file > "`basename $file .fasta`_single-line.fasta"
done

python combine.py

And the above python code into, combine.py.

Run the pipeline as bash combine.py in the directory with your sample fastas. This will run in one step.

ADD REPLY
0
Entering edit mode

Thanks @st.ph.n. run.sh worked. :)

combine.py also worked but file name did not appear near to 1st sequence and no sequence for last. Also, all files are of same size, but I did not understand that all files are in order except file 8 and 9.

#gene1.fasta
>gene1_file10
ATGC...............................>gene1_file8
ATGC...............................>gene1_file9
NTGC...............................>gene1_file7
ATGC...............................>gene1_file6
NNGC...............................>gene1_file5
ATGC...............................>gene1_file4
ATGC...............................>gene1_file3
ATGC...............................>gene1_file2
ATGC...............................>gene1_file1
ATGC...............................

#gene2.fasta
Same for gene2 and other genes
ADD REPLY
0
Entering edit mode

What are you using to view the new fasta? I'm confused about what the issue is.

ADD REPLY
0
Entering edit mode

Notepad++ editor. I mean file names are shifted.

ADD REPLY
1
Entering edit mode

What do you need to do with the new files? I recommend vi, or nano on the CLI. Notepad sometimes has issues with newlines, in my experience. Word typically handles these okay.

ADD REPLY
0
Entering edit mode

Same issue with vi, vim also.

ADD REPLY
0
Entering edit mode

I don't see why the file is appearing like you posted. I ran a test on this script, and it appears fine. What python version are you using? python --version

try changing this line:

out.write('>' + g + '_' + genes[g][n][0] + '\n' + genes[g][n][1])

to this, to print to stdout and file:

print '>' + g + '_" + genes[g][n][0], '\n', genes[g][n][1]
print >> out, '>' + g + '_' + genes[g][n][0], '\n', genes[g][n][1]
ADD REPLY
1
Entering edit mode

You just need to add a '\n' at the end of the string using the write method.

ADD REPLY
0
Entering edit mode

I've never had an issue with how it's formatted in the past since within a loop, each print line should print to a new line for the header and another for the sequence, but adding the newline character to the end should fix it.

ADD REPLY
0
Entering edit mode

Thanks Matt Shirley

ADD REPLY
0
Entering edit mode

@st.ph.n

Python 2.7.12

I edited combine.py and ran again, but it showed following error:

  File "combine.py", line 25
    print  '>'  + g + '_" + genes[g][n][0], '\n', genes[g][n][1]
                                                              ^
SyntaxError: unexpected character after line continuation character
ADD REPLY
1
Entering edit mode

just add the '\n' as Matt suggested.

out.write('>' + g + '_' + genes[g][n][0] + '\n' + genes[g][n][1] + '\n')
ADD REPLY
0
Entering edit mode

Thank you very much st.ph.n, it worked :-)
Would you please provide more explanation for the script, if possible?

ADD REPLY
0
Entering edit mode

There are some comment lines. What needs further explanation?

ADD REPLY
0
Entering edit mode
genes[line.strip().split('>')[1]].append((file.split('.fasta')[0], next(f).strip()))

The meaning of [1] and [0]?

ADD REPLY
1
Entering edit mode

[int] specifies in index, in this case line.strip().split('>')[1] is splitting the header into '>' and 'gene name', and python starts at 0 so we're taking the gene name. Sae with splitting the file name into parts, 'file name' and '.fasta', and we're taking the file name.

ADD REPLY
1
Entering edit mode
7.3 years ago

You can use pyfaidx to do this in one step:

$ pip install pyfaidx
$ faidx -x -e "lambda x: x + '_sample1'" test.fasta

The faidx script is comparable to the pyfasta script, but has many more features. It also creates/uses the .fai index, just like samtools. The -x argument splits the input file into separate files, named using the header. The -e argument takes a python lambda function that returns a modified header name, allowing you to do complex transformations. If you don't pass any region positional arguments, faidx will operate over all sequences in the input FASTA file, so you don't need a bash loop.

ADD COMMENT
0
Entering edit mode

@Matt Shirley Thanks, it will take only one fasta file at a time. My aim is to extract the genes with similar gene name from all the fasta files and make gene specific fasta files with updated header including gene name and file name (so that I should know from which file that gene came) + append the gene sequences in the file with that gene name. Here are the sample input and output files:

Input files:
#file1.fasta

>gene1
ATGC
>gene2
ATGA
>gene3
ATGTTT

#file2.fasta

>gene1
ATGG
>gene2
ATGC
>gene3
ATGTT

Expected output files:

#gene1.fasta
>gene1_file1
ATGC
>gene1_file2
ATGG

#gene2.fasta
>gene2_file1
ATGA
>gene2_file2
ATGC
ADD REPLY

Login before adding your answer.

Traffic: 1984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6