Hello I downloaded more than 10k cds from NCBI and Headers look like this
lcl|DS180867.1_cds_EDM56551.1_2 [protein=hemolysin] [protein_id=EDM56551.1] [location=complement(<1..1012)]
lcl|DS180866.1_cds_EDM56342.1_3 [protein=phosphogluconate dehydratase] [protein_id=EDM56342.1] [location=complement(279..>893)]
lcl|DS180865.1_cds_EDM56120.1_4 [protein=3-hydroxyisobutyrate dehydrogenase] [frame=3] [protein_id=EDM56120.1] [location=complement(162..>559)]
lcl|DS180863.1_cds_EDM56465.1_5 [protein=hemolysin] [protein_id=EDM56465.1] [location=218..>977]
lcl|DS180862.1_cds_EDM56350.1_6 [protein=phosphogluconate dehydratase] [protein_id=EDM56350.1] [location=complement(<1..857)]
i want to extract same all similar sequences (based on the header) into a new fasta file based on protein name (example hemolysin)
please suggest any tool or programme for this
Have you tried anything?
Yah i tried with AWK but its not working properly
can you post your
awk
code, so that people here can fix the problem in it. I feel that this would be a good idea instead of directly getting a working solution.awk -F "|" '/^>/ {F = $3".fasta"} {print > F}' input.fasta>out.fasta
How do you plan to define similarity based on the header? By the [protein=...] tag?