I'm looking for help on how to extract uppercase/lowercase information from multi fasta format file. I have been using the 'SNPable regions' method to mask a non-reference genome (http://lh3lh3.users.sourceforge.net/snpable.shtml). Hence, presently I have my information in the following format:
I want to compare this to another masking approach and need to format it. For simple comparison, I'd like it formattet as contig / position / + for uppercase or - for lowercase, i.e.:
A quick and simple solution would be using awk on the fasta file. Splitting the string into an array of characters and check with a regex whether the item is lower case or not:
Ahh, you are correct with regards to structuring the FASTA file. I over simplified it a bit too much. In truth it is a multi fasta file with new line inserts for every 60 nt, so your quick fix only works for the first sequence line following >Contig. It is a very standard file like so:
Search "linearize fasta" and pipe output of that to michael.ante's solution..
Correct, I'd go for the Fasta formatter from the fastx toolkit.
Thank you for the help