Question

How to create a fasta file using a list of DNA sequences data

0

Entering edit mode

3.3 years ago

Alex S ▴ 20

I have a file with the following structure:

Lcn.Chr1:75500000-95000000:1393900-1393947  gaaatgatttaattagattatttgaggtttgatgattaggattagag 1648480
Lcn.Chr1:75500000-95000000:1393980-1394025  AAATATGAACTCAGGGTTTTGAGATAAGCCAAACAACGATTCCAC   1648480
Lcn.Chr1:75500000-95000000:1394080-1394127  caccccaacttttataattgctatttaaattaattaattagtattgt 1648480

I've extracted the sequences using cut -f 2, now I need to make them as a .fasta format to use it as a database for a blast analysis. Any tips on how to add the fasta header to those sequences? The IDs could be numbers 001, 002, 003..

linux fasta • 1.5k views

ADD COMMENT • link updated 3.3 years ago by Shred ★ 1.6k • written 3.3 years ago by Alex S ▴ 20

2

Entering edit mode

3.3 years ago

Shred ★ 1.6k

In Python3

import sys

with open(sys.argv[1], 'r') as sequences:
    for idx,line in enumerate(sequences):
        print(f">{idx:03d}")
        print(line.rstrip().split('\t')[1])

Launch it as

python3 script.py your_input_file > output.fasta

It produces

>0001
gaaatgatttaattagattatttgaggtttgatgattaggattagag
>0002
AAATATGAACTCAGGGTTTTGAGATAAGCCAAACAACGATTCCAC
..

ADD COMMENT • link 3.3 years ago by Shred ★ 1.6k

score 2 · Accepted Answer · 2022-08-08

Using your example file (let's call it seqs.txt).

cat seqs.txt| while read line; do printf "%s%s\n%s\n" ">" $(echo $line | cut -d " " -f 1) $(echo $line | cut -d " " -f 2); done

produces:

>Chr1:75500000-95000000:1393900-1393947
gaaatgatttaattagattatttgaggtttgatgattaggattagag
>Lcn.Chr1:75500000-95000000:1393980-1394025
AAATATGAACTCAGGGTTTTGAGATAAGCCAAACAACGATTCCAC
>Lcn.Chr1:75500000-95000000:1394080-1394127
caccccaacttttataattgctatttaaattaattaattagtattgt

That's a pretty ugly solution, but it should work.