Question

extracting sequences from a fasta file

0

Entering edit mode

7.5 years ago

ashish ▴ 680

I am using this biopython script from this post, first answer written by Eric. The post was very old so I am adding a new post for it. the script take ids from a .txt file and extracts their corresponding sequences from another fasta file. But I've a problem here, the ids i am extracting lies in the description whereas the script searches just the first word after > sign. how do I change it so that it can look for the ids i am providing in the header description. I tried changing it myself after reading the comments but I think I am doing it wrong. my .txt id file look like this:

TRIAE_CS42_1AL_TGACv1_000062_AA0001

TRIAE_CS42_1AL_TGACv1_000089_AA0002

TRIAE_CS42_1AL_TGACv1_000099_AA0003

TRIAE_CS42_1AL_TGACv1_000110_AA0004

TRIAE_CS42_1AL_TGACv1_000140_AA0005

The header in the fasta file looks like this:

>TRIAE_CS42_U_TGACv1_641895_AA2106830.1 pep scaffold:TGACv1:TGACv1_scaffold_641895_U:99996:109837:1 gene:TRIAE_CS42_U_TGACv1_000110_AA0004 transcript:TRIAE_CS42_U_TGACv1_641895_AA2106830.1 gene_biotype:protein_coding transcript_biotype:protein_coding

biopython python • 5.6k views

ADD COMMENT • link updated 7.5 years ago by Buffo ★ 2.4k • written 7.5 years ago by ashish ▴ 680

0

Entering edit mode

so, are the list gene names?

ADD REPLY • link 7.5 years ago by shenwei356 8.7k

0

Entering edit mode

The list are gene ids And the fasta file have protein sequences which have the gene id written in the header description

ADD REPLY • link 7.5 years ago by ashish ▴ 680

0

Entering edit mode

above solution should work.

ADD REPLY • link 7.5 years ago by shenwei356 8.7k

0

Entering edit mode

This worked very well. It was so easy. Can you explain what does this "gene:([^ ]+)" mean. In the tool help I found this line:
--id-regexp string regular expression for parsing ID (default "^([^\s]+)\s?") what does the symbols mean?

ADD REPLY • link 7.5 years ago by ashish ▴ 680

1

Entering edit mode

Test using regular expression tester page.

ADD REPLY • link 7.5 years ago by GenoMax 147k

0

Entering edit mode

Thanks, this was really important for me to see.

ADD REPLY • link 7.5 years ago by ashish ▴ 680

1

Entering edit mode

it's a regular expression for matching "gene:xxxxx", 、[^ ]+ is for gene id consisting of non-space characters, and seqkit has to use () to capture the xxx as FASTA ID.

ADD REPLY • link 7.5 years ago by shenwei356 8.7k

0

Entering edit mode

ashish : I moved @shenwei356's comment to an answer. Since it worked for you, please accept the answer (use green check mark) to provide closure for this question.

ADD REPLY • link 7.5 years ago by GenoMax 147k

1

Entering edit mode

7.5 years ago

Buffo ★ 2.4k

This is my python script to do that, save it as sequence_extractor.py and run it :)

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

from __future__ import division
import sys


##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python sequence_extractor.py *list.txt fasta_file.fasta
">" has to be included in the names 
------------------------------------------------------------------------------------
'''
##########################################################################################

if len(sys.argv) != 3:
    print syntax
    sys.exit()

##########################################################################################

nombres = []
seq = ""


lista_seq = open(sys.argv[1], 'r')

for line in lista_seq:
    line = line.rstrip('\n') 
    nombres.append(line)
lista_seq.close()

fasta_seqs = open(sys.argv[2], 'r')

for line in fasta_seqs:
    line = line.rstrip('\n')
    if line.startswith('>'):
        if seq:
            if name in nombres:           
                print name + '\n' + seq
                seq = ""
        name = line                         #to exclude '>'; line.split()[0]

    else:
        seq = seq + line 


if name in nombres:
    print name + '\n' + seq

ADD COMMENT • link 7.5 years ago by Buffo ★ 2.4k

0

Entering edit mode

it gives an error SyntaxError: Missing parentheses in call to 'print' at line 15

ADD REPLY • link 7.5 years ago by ashish ▴ 680

1

Entering edit mode

this script needs python2, you ran using py3.

ADD REPLY • link 7.5 years ago by shenwei356 8.7k

score 2 · Accepted Answer · 2017-05-18

2

Entering edit mode

7.5 years ago

shenwei356 8.7k

An easy way provided by seqkit:

seqkit grep -f ids.txt --id-regexp  "gene:([^ ]+)"   seqs.fa

Cause the ID you want to search is in the FASTA description not the regular FASTA ID. SeqKit provides way to specify where the ID is by regular expression.

"gene:([^ ]+)" is a regular expression for matching "gene:xxxxx" which contains the gene-id for searching. [^ ]+ is for gene id consisting of non-space characters, and seqkit has to use () to capture the xxxxx as FASTA ID.

ADD COMMENT • link 7.5 years ago by shenwei356 8.7k

0

Entering edit mode

Hello, I have installed Seqkit.I am trying to list out some sequences based on the Seq ids.provided in a text doc.I am using this code. my ids.txt file looks like this: bin1 wbah10_accessory_1487_length_224941 bin2 wbah10_accessory_1485_length_153623 bin4 wbah10_accessory_1593_length_85091 bin5 wbah10_accessory_0973_length_66623 bin6 wbah10_accessory_0972_length_51198 bin7 wbah10_accessory_1486_length_50757 bin8 wbah10_accessory_0969_length_49768

and the header in the query file looks like this:

bin1 wbah10_accessory_1487_length_224941 tccgccttcgctaaagcttccgccttcgccaaggcttcggcgcgacaagtccgcttcggcccgatttctcaccagaatttgcgattttttacggcgccggactcgcggagggtccccctcacccggaatccgcgcgtcgcgcggattccggcctctccccggcggggagaggcgaagggaagcggcccttatttcggcaggaattcctgcgcaacgcccataccga

seqkit grep -f ids.txt --id-regexp "gene:([^ ]+)" seqs.fa

but I am getting an error mentioned below:

[ERRO] fastx: stdin not detected

I am new to the command line approach.Any help would be appreciated.

Thanks in advance

Sohini

ADD REPLY • link 6.4 years ago by sohiniguha1985 • 0

0

Entering edit mode

For you

seqkit grep -n -f ids.txt seqs.fa -o result.fa

[ERRO] fastx: stdin not detected means no input file provided and stdin not detected.

ADD REPLY • link 6.4 years ago by shenwei356 8.7k