I'm looking for a way to create an text file containing some information about sequence reads, extracted from a .fasta file. Either by using grep, sed or awk.
Basically i have several fasta sequences which i have trimmed, so i an example of a header for a trimmed fasta file with a sequence where i have the original as well as the trimmed length
>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e runid=f51153f9c3ec50d37d212f8f83dc387ac416f3c9 read=3826 ch=60 start_time=2018-11-21T16:47:21Z barcode=barcode01 trim=0-1060
So the information i want from this header is the:
read name ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e
original read length; 3826
trimmed length: 0-1600
So far i've done this part
grep -o -E "^>\w+|.read=\w+|.trim=\w+" test.fasta
Which yields the output
>ca51a0fa
read=3826
trim=0
What im looking for, would either be this
>ca51a0fa
read=3826
trim=0-1060
Or this
>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e
read=3826
trim=0-1060
And I can't really get it to work, would any of you have a suggestion. Thanks
Why not use
awk
, delimit on space and then print the fields you need?Because i didn't think of that, all of the examples i could find handling fasta headers was using grep, so i thought i might as well stay with using grep. well that worked perfectly, thanks
Thanks for your suggestions for both options.
SEDA (https://www.sing-group.org/seda/) has an operation to process FASTA headers and do this type of things. It is called 'Rename header' (https://www.sing-group.org/seda/manual/operations.html#rename-header) and may be useful to you. You do not even need to install SEDA, you can use the Docker image with the latest version available at Docker Hub (https://hub.docker.com/r/pegi3s/seda/). Regards!
It looks really useful. Thanks!