I have so many bacterial Refseq fasta files and want to parse the headers in the fasta files to see if is there any word 'chromosome' in the headers, as a side note there are some sequences in FASTA files started with '>' so i want to parse all the lines staring with '>' . I know i have files that do not have word like 'chromosome' . I would like to separate the files with header 'chromosome' from the rest of files. Is there a way to do so?
any help would be appreciated.
If you want to get faster/better/more accurate answers it would really help if you show some examples of your data, and how these have to be "parsed".
I’m not sure what the aim of filtering the genomes is by the word chromosome is exactly?
To my knowledge the work chromosome in the header doesn’t tell you anything about that assembly specifically.
Assuming that fasta is linearized (i.e sequence is in single line, after header):
should give you all the fasta sequences with no chromosome in header.
should give you all the fasta sequences with chromosome in header.