I have a .fa file that is the result from extracting gene sequences from a repeat masker coordinate file on the genome I am working with. However, each input looks as such:
>rnd-5_family-5445_Unspecified(-)
TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG
I would like to change the title lines of all the inputs to look something like:
>rnd-5_family-5445_KB824701.1_417_478
This includes the original name and then the location from the repeat masker bed file. I was told using data.frame would be useful in R to make this adjustment but I am not aware really how to go about this. Any help?
Do you work in Linux? It seems that your problem is more easily solved using sed or any other command-line tool than R data.frame. Sure you can import those files in R as a data.frame, but I can't see how that could facilitate things...Could you provide more information about your data structure? For example, do you have two files with identifiers you'd like to combine, or something like that?
I am being asked to write a script for this so I am not really sure to be honest. I just need to write a script that is a middle point between the bedtools output from the top and the new modified version.
Okay, can you provide a brief example of your data (i.e. 3-4 entries you'd like to combine) so I can help you build the script? I'm sure there's a way to do it in R, but as I said before, it's probably easier using Unix tools, if you have a computer with Linux/Mac at hand.
this is the bedtools output:
This is the repeat masker coordinates that the above were derived from:
KB824701.1 417 478 rnd-5_family-5445_Unspecified . -
KB824701.1 587 1072 rnd-5_family-2614_Unspecified . -
KB824701.1 914 1129 rnd-5_family-2614_Unspecified . -
KB824701.1 1138 1225 rnd-4_family-798_Unspecified . -
and ideally I would like it to be in the format of this, for example using the first one: