* Thanks for your answers, my dear colleagues, it seems that I can't click that reply button *
I have 2 sets of fasta sequences, they are actually 2 genes of 9 species. I put the sequences of 9 species of the same gene into one folder, and the other gene into another folder. Now I want to concatenate two genes together for each species, but the first line of each fasta file looks like:
>HM357896.1 Persicaria lapathifolia voucher CPU:X. H. Meng 0945 ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit (rbcL)
or
>JF953049.1 Acorus calamus voucher WH1 maturase K (matK) gene, partial cds; chloroplast
I think regular expression must be useful here, but how? Thank you.
UPDATE:: Sorry about my misleading description. To be specific, e.g. I have five species A B C D E, and two genes rbcL and matK. For each species I have two sequences, rbcL and matK. Thus I have 10 sequences in total (5 x 2). Then I combine all rbcL sequences (of five species) into one fasta, say all_rbcL.fasta, and I do the same to matK genes to make a all_matK.fasta. However, the first lines of these sequences seems to be messy, they do contain species name and gene name, but along with many other info.
How can I concatenate two genes together, and the species names must match each other?
UPDATE2:: (How could I enter code blocks?)
all_rbcL.fasta:
>sp1 rbcL
sequence
>sp2 rbcL
sequence
>sp3 rbcL
sequence
>sp4 rbcL
sequence
>sp5 rbcL
sequence
all_matK.fasta:
>sp1 matK
sequence
>sp2 matK
sequence
>sp3 matK
sequence
>sp4 matK
sequence
>sp5 matK
sequence
I mean something like this, and what I expected is:
concatenated.fasta:
>sp1 matK rbcL
sequence sequence
>sp2 matK rbcL
sequence sequence
>sp3 matK rbcL
sequence sequence
>sp4 matK rbcL
sequence sequence
>sp5 matK rbcL
sequence sequence
These two genes are from chloroplast, I do this to use them to build a phylogenetic tree of those 9 species, is it impossible or improper? I consulted a professor and he told me it is OK, and I would like to hear your opinions, thank you.
Please be specific about your requirement.
concatenation
(joining to form a single file) can be done with a simplecat file1.fa file2.fa .. file9.fa > final_gene1.fa
. If you want to take actual sequences into account and make a non-redundant representation then that would be more complicated.You've used an example with two headers here - are they the same gene? Or are they the same species? How are they relevant to your question?
That is still not very clear. You want something like this
concatenation is usually referred to when you add two lines underneath each other. from your example it seems as if you want both concatenation and pasting (= the adding of two columns next to each other)
I assume that your files are labelled somewhat systematically, so pasting the sequences of the same gene for different species next to each other should be trivial, e.g.:
you could do this for every gene, e.g. by using a for-loop looping over the gene names which are hopefully part of the fasta file name.
now, I assume that the resulting header is what you wanted the regex help for?
for my butchered example above, this could, for example, be solved this way:
This particular (largely untested) regex expects:
\t
] followed by the second pattern in ( ) [=first gene name] followed by a tab [\t
] followed by the third pattern [the second gene name]You've explained your question well now, but I have to ask you - why do you want to do this? What is the ultimate aim? This seems like a convoluted procedure that has the end result of loss of meaningful information.