extracting sequences from a fasta file
2
0
Entering edit mode
7.5 years ago
ashish ▴ 680

I am using this biopython script from this post, first answer written by Eric. The post was very old so I am adding a new post for it. the script take ids from a .txt file and extracts their corresponding sequences from another fasta file. But I've a problem here, the ids i am extracting lies in the description whereas the script searches just the first word after > sign. how do I change it so that it can look for the ids i am providing in the header description. I tried changing it myself after reading the comments but I think I am doing it wrong. my .txt id file look like this:

TRIAE_CS42_1AL_TGACv1_000062_AA0001

TRIAE_CS42_1AL_TGACv1_000089_AA0002

TRIAE_CS42_1AL_TGACv1_000099_AA0003

TRIAE_CS42_1AL_TGACv1_000110_AA0004

TRIAE_CS42_1AL_TGACv1_000140_AA0005

The header in the fasta file looks like this:

>TRIAE_CS42_U_TGACv1_641895_AA2106830.1 pep scaffold:TGACv1:TGACv1_scaffold_641895_U:99996:109837:1 gene:TRIAE_CS42_U_TGACv1_000110_AA0004 transcript:TRIAE_CS42_U_TGACv1_641895_AA2106830.1 gene_biotype:protein_coding transcript_biotype:protein_coding
biopython python • 5.6k views
ADD COMMENT
0
Entering edit mode

so, are the list gene names?

ADD REPLY
0
Entering edit mode

The list are gene ids And the fasta file have protein sequences which have the gene id written in the header description

ADD REPLY
0
Entering edit mode

above solution should work.

ADD REPLY
0
Entering edit mode

This worked very well. It was so easy. Can you explain what does this "gene:([^ ]+)" mean. In the tool help I found this line:
--id-regexp string regular expression for parsing ID (default "^([^\s]+)\s?") what does the symbols mean?

ADD REPLY
1
Entering edit mode

Test using regular expression tester page.

ADD REPLY
0
Entering edit mode

Thanks, this was really important for me to see.

ADD REPLY
1
Entering edit mode

it's a regular expression for matching "gene:xxxxx", 、[^ ]+ is for gene id consisting of non-space characters, and seqkit has to use () to capture the xxx as FASTA ID.

ADD REPLY
0
Entering edit mode

ashish : I moved @shenwei356's comment to an answer. Since it worked for you, please accept the answer (use green check mark) to provide closure for this question.

ADD REPLY
2
Entering edit mode
7.5 years ago

An easy way provided by seqkit:

seqkit grep -f ids.txt --id-regexp  "gene:([^ ]+)"   seqs.fa

Cause the ID you want to search is in the FASTA description not the regular FASTA ID. SeqKit provides way to specify where the ID is by regular expression.

"gene:([^ ]+)" is a regular expression for matching "gene:xxxxx" which contains the gene-id for searching. [^ ]+ is for gene id consisting of non-space characters, and seqkit has to use () to capture the xxxxx as FASTA ID.

ADD COMMENT
0
Entering edit mode

Hello, I have installed Seqkit.I am trying to list out some sequences based on the Seq ids.provided in a text doc.I am using this code. my ids.txt file looks like this: bin1 wbah10_accessory_1487_length_224941 bin2 wbah10_accessory_1485_length_153623 bin4 wbah10_accessory_1593_length_85091 bin5 wbah10_accessory_0973_length_66623 bin6 wbah10_accessory_0972_length_51198 bin7 wbah10_accessory_1486_length_50757 bin8 wbah10_accessory_0969_length_49768

and the header in the query file looks like this:

bin1 wbah10_accessory_1487_length_224941 tccgccttcgctaaagcttccgccttcgccaaggcttcggcgcgacaagtccgcttcggcccgatttctcaccagaatttgcgattttttacggcgccggactcgcggagggtccccctcacccggaatccgcgcgtcgcgcggattccggcctctccccggcggggagaggcgaagggaagcggcccttatttcggcaggaattcctgcgcaacgcccataccga

seqkit grep -f ids.txt --id-regexp "gene:([^ ]+)" seqs.fa

but I am getting an error mentioned below:

[ERRO] fastx: stdin not detected

I am new to the command line approach.Any help would be appreciated.

Thanks in advance

Sohini

ADD REPLY
0
Entering edit mode

For you

seqkit grep -n -f ids.txt seqs.fa -o result.fa

[ERRO] fastx: stdin not detected means no input file provided and stdin not detected.

ADD REPLY
1
Entering edit mode
7.5 years ago
Buffo ★ 2.4k

This is my python script to do that, save it as sequence_extractor.py and run it :)

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

from __future__ import division
import sys


##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python sequence_extractor.py *list.txt fasta_file.fasta
">" has to be included in the names 
------------------------------------------------------------------------------------
'''
##########################################################################################

if len(sys.argv) != 3:
    print syntax
    sys.exit()

##########################################################################################

nombres = []
seq = ""


lista_seq = open(sys.argv[1], 'r')

for line in lista_seq:
    line = line.rstrip('\n') 
    nombres.append(line)
lista_seq.close()

fasta_seqs = open(sys.argv[2], 'r')

for line in fasta_seqs:
    line = line.rstrip('\n')
    if line.startswith('>'):
        if seq:
            if name in nombres:           
                print name + '\n' + seq
                seq = ""
        name = line                         #to exclude '>'; line.split()[0]

    else:
        seq = seq + line 


if name in nombres:
    print name + '\n' + seq
ADD COMMENT
0
Entering edit mode

it gives an error SyntaxError: Missing parentheses in call to 'print' at line 15

ADD REPLY
1
Entering edit mode

this script needs python2, you ran using py3.

ADD REPLY

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6