Hey,
I got a multifasta file like this
>stuff1;[gene1];stuff,morestuff
ATGGAGATAATAGATAGC
>stuff1;[gene2];stuff,morestuff
ATGGAGATAATAGATAGC
>stuff2;[gene1];stuff,morestuff
GTACTACATCGCTAGCACTACT
>stuff2;[gene2];stuff,morestuff
GTAGTCATCAGCTACGACTACT
So between each ID and sequence is a new line. I want to extract e.q. all IDs and their sequences with [gene1], basically search the ID for a term and then extract ID and seq into a new fasta file with the filename of the extracted term. It is important that the complete ID is extracted, but the "search term" is just short ( in this case, [gene1])
I tried awk
and grep
awk'/[gene1]/' RS='>' input.fasta > output.fasta
grep "[gene1]" input.fasta > output.fasta
But this just gave me all lines after [gene1] in both cases.
When searching for [gene1], i need a new multifasta like this:
>stuff1;[gene1];stuff,morestuff
ATGGAGATAATAGATAGC
>stuff2;[gene1];stuff,morestuff
GTACTACATCGCTAGCACTACT
Best Regards
What do you mean by
Do you mean empty new line? If the sequences do not contain new line character (i.e. are not folded) then you can try:
I meant that the header and the following seq are on seperate lines. There is no empty line between. Would it be easier if they were on the same line?
Ok, my problem was: since I searched for my term in brackets "[gene1]" instead of "gene1" it searched for all possible matchings, like: Is there g, ge, gen,and so on.
is working
You may be interested in using SEDA (http://www.sing-group.org/seda/), which has different functions for extracting and filtering sequence identifiers. Regards.
or
are POSIX Bracket Expressions https://www.regular-expressions.info/posixbrackets.html .
you want something like
grep -A 1 --no-group-separator -F '[gene1]'
if there is only one sequence after the header.