I have a large fasta header formatted file (input.txt) like:
>NC_23689
#
# XYZ
# Copyright (c) BLASC
#
# Predicted binding regions
# No. Start End Length
# 1 1 25 25
# 2 39 47 9
#
>68469409
#
# XYZ
# Copyright (c) BLASC
#
# Predicted binding regions
# None.
#
# Prediction profile output:
# Columns:
# 1 - Amino acid number
# 2 - One letter code
# 3 - probability value
# 4 - output
#
1 M 0.1325 0
2 S 0.1341 0
3 S 0.1384 0
>68464675
#
# XYZ
# Copyright (c) BLASC
#
# Predicted binding regions
# No. Start End Length
# 1 13 24 12
# 2 31 53 23
# 3 81 95 15
# 4 115 164 50
#
...
...
I want to extract each fasta header and its corresponding start-end value(s) (after Predicted binding regions line) in a (output.txt file). For the above (input.txt), the output will be:
NC_23689: 1-25, 39-47
68464675: 13-24, 31-53, 81-95, 115-164
I have tried some regular perl and python scripts ut it looks like not that much straightforward to me. I was wondering whats the best way to get these lines out from the input file for every single header?
Thanks
Interesting way of describing the file (fasta header formatted). A fasta formatted file is expected to have sequence data in it (protein or DNA). This has neither.