Suppose I have a phylip alignment like: (I should mention that it is interleaved)
KM894618.1_Abutilon_oxycarpum_voucher_1076420545_maturase_K_(matK)_gene_partial_cds_chloroplast --------------------------------TCTTTGCATTTATTACGGTTCTCTCTCT
KU508975.1_Acalypha_australis_maturase_K_(matK)_gene_partial_cds_chloroplast AAATTCTTCGATATTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTC
KC747175.1_Achyranthes_bidentata_bio-material_USDA AAACTCTCCGATACTGGTTGAAAGATGCTTCTTCTTTGCATTTATTACGATTCTTTCTTT
KF632783.1_Acorus_calamus_voucher_C998_maturase_K_(matK)_gene_partial_cds_chloroplast AAGTTCTGCAAGGCTGGATACAAGATGTTCCGTCTTTACATTTATTGCGGTTCTTTCTCC
JQ587494.1_Aeschynomene_americana_voucher_BioBot11660_maturase_K_(matK)_gene_partial_cds_chloroplast ------------------------------------------------------------
KR735146.1_Ageratum_conyzoides_maturase_K_(matK)_gene_partial_cds_chloroplast ------------------------------------------------------------
GU135030.1_Alternanthera_philoxeroides_voucher_J.R._Abbott_24898_(FLAS)_maturase_K_(matK)_gene_partial_cds_chloroplast AAACTCTCCGATACTGGTTGAAAGATGCTTCTTCTTTGCATTTATTACGATTCTTTCTTT
JF953164.1_Amaranthus_tricolor_voucher_Z31_maturase_K_(matK)_gene_partial_cds_chloroplast ------------------------------------------------GATACTTTCTTT
HM989726.1_Artemisia_argyi_voucher_PS0590MT04_maturase_K_(matK)_gene_partial_cds_chloroplast AGGCTCTTCGCTATTGGATAAAAGATGCTTCCTCTTTGCATTTATTAAGATTCTTTCTCC
KF163819.1_Arthraxon_hispidus_voucher_HCCN-PJ008548-PB-280_maturase_K_(matK)_gene_partial_cds_chloroplast -----------------------GATGTTCCGTCTTTGCMTTTATTGCGATTCWTTCTCC
MG225316.1_Aster_alpinus_voucher_BAB-2621_maturase_K_(matK)_gene_partial_cds_chloroplast -----------------------------TCCTCTTTGCATTTATTAAGATTCTTTCTCC
MF063987.1_Bassia_scoparia_voucher_20160248_maturase_K_(matK)_gene_partial_cds_chloroplast -----------------------------------------------CGATTCTTTCTTT
JQ412229.1_Cynodon_dactylon_voucher_BS0132_maturase_K_(matK)_gene_partial_cds_chloroplast ---------------------------------------------------TCTTTCTCA
JN895697.1_Bidens_tripartita_isolate_NMW088_maturase_K_(matK)_gene_partial_cds_chloroplast ---------------------------CTTCCTCTTTGCATTTATTAAGATTCTTTCTCC
MF350103.1_Boehmeria_nivea_isolate_AD5JT02_maturase_K_(matK)_gene_partial_cds_chloroplast ----------------GGTAAAAGACGCCTCCTCTTTGTATTTATTAAGACTTTTTCTTT
How can I remove these headings, but only keep the species names?
Fancier answers will be forthcoming but this should work. Use your real filename. Save code in a file.
python3 script.py > newfile
Should produce
Perhaps it is not capable of simplifying interleaved alignment, this script gave me an list index out of range error,
You had extra carriage returns between all lines (I am not sure if your original file has them or they were introduced when you copy/pasted the data). I took those out in my copy (and removed them from the post above as well). Try the example above again.
Thanks, I tried the script in windows powershell, now I'm going to install pycharm to dive into python deeper
Phylip strict format requires headers less than 10 characters, and the inconsistent spacing that this produces will throw errors, you'd need to pad the deleted space with whitespace to restore the alignment.
@OP, if you have the original sequences, your life will be much easier to edit the headers in the sequence file and then re-align.
use
sed
with a file of patterns (option-f
)I'm not familiar with this command, could you please explain more?