I am looking a command or script by which I can remove multiple sequences from fasta file using gene ID.
I am looking a command or script by which I can remove multiple sequences from fasta file using gene ID.
There is a very nice, fast and convenient program called faSomeRecords that can do what you are asking. In fact, can do two sorts of things by giving a list. Either extract those fasta file that are containing into a list file, or discard them It is working in the Linux environment, though
You just need to run faSomeRecords to get information about the options
linux@ARFLinux:~$ faSomeRecords
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.
If you store the gene names in a variable you can loop and pull them out with awk:
id='gen1 gene2' ;
for gene in $id;
do
awk '/'$gene'/{flag=1;print $0;next}/^>/{flag=0}flag' file.fasta >> outfile.fasta ;
done
edit: For a list of genes in a file use:
id=$(cat genelist)
instead of the 'manual' id assignement above.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please add more information, concerning how your data looks like. I can only guess that the gene id is part of the fasta description or identifier.
In addition, if you show what you tried (and didn't work) people will be more eager to help you. Show some effort from your side too.