I am using this biopython script from this post, first answer written by Eric. The post was very old so I am adding a new post for it. the script take ids from a .txt file and extracts their corresponding sequences from another fasta file. But I've a problem here, the ids i am extracting lies in the description whereas the script searches just the first word after > sign. how do I change it so that it can look for the ids i am providing in the header description. I tried changing it myself after reading the comments but I think I am doing it wrong. my .txt id file look like this:
TRIAE_CS42_1AL_TGACv1_000062_AA0001
TRIAE_CS42_1AL_TGACv1_000089_AA0002
TRIAE_CS42_1AL_TGACv1_000099_AA0003
TRIAE_CS42_1AL_TGACv1_000110_AA0004
TRIAE_CS42_1AL_TGACv1_000140_AA0005
The header in the fasta file looks like this:
>TRIAE_CS42_U_TGACv1_641895_AA2106830.1 pep scaffold:TGACv1:TGACv1_scaffold_641895_U:99996:109837:1 gene:TRIAE_CS42_U_TGACv1_000110_AA0004 transcript:TRIAE_CS42_U_TGACv1_641895_AA2106830.1 gene_biotype:protein_coding transcript_biotype:protein_coding
so, are the list gene names?
The list are gene ids And the fasta file have protein sequences which have the gene id written in the header description
above solution should work.
This worked very well. It was so easy. Can you explain what does this "gene:([^ ]+)" mean. In the tool help I found this line:
--id-regexp string regular expression for parsing ID (default "^([^\s]+)\s?") what does the symbols mean?
Test using regular expression tester page.
Thanks, this was really important for me to see.
it's a regular expression for matching "gene:xxxxx", 、
[^ ]+
is for gene id consisting of non-space characters, and seqkit has to use()
to capture the xxx as FASTA ID.ashish : I moved @shenwei356's comment to an answer. Since it worked for you, please accept the answer (use green check mark) to provide closure for this question.