Entering edit mode
5 months ago
resug
▴
40
Hi,
I am trying to make the nucleotide lines linear, instead of split every 60 nucleotides, and keeping the blank lines between every entry. So the axt file should look from this:
Lm_g1002.t1-Lp_g998.t1
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACT
ATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTAT
CTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACT
ATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTAT
CTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
Lm_g1010.t1-Lp_g1011.t2
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCA
AACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTC
GGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACC
GTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGGCGGTGT
TGTCATTTCCGG------CGTCGCCGCGGTTTTTGTGATTTG------------------
------------------------------------------------------------
---GTC
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCA
AACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTC
GGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACC
GTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGTAATTTG
GATCTTTACCGGTGTGTTAGTTTGATCGGGTATTGTATGCTACCTATGGTGATTTTGTCC
CATCGTGGATTAGTTGCGTATGGTTGCTTTCTTATTTACACTTTGTTTTCGCTTCTTGTC
GTGTTT
Lm_g1018.t1-Lp_g1004.t1
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCC
CATCAGGGATTGAAAATGTCCCTCGCGGTGTTCAGTGCTTTCAGCATTGGTGTTGCAGTT
CCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGGT
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCC
CATCAGGGATTGAAAATGTCCCTCGCGGTGTTCACTGCTTTCAGCATTGGTGTTGCAGTT
CCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGTT
Lm_g100.t1-Lp_g97.t1
------ATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAAC
GATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAG
GTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTT------
------------------------------------------------------------
------------------------CTTGAGTCTTCACTGGTGGATGTAGTTGTACATGAA
AGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTG
CCAGAGAACATGATGGAGAAA
ATGACCATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAAC
GATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAG
GTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTTCTTGAG
ACATCTTTTGAGAAGGCAATGACAATTATGGACCTCTGTTCTGAATCCTTAGCAGCAGAC
AAAATGAATGCTGAAAATGCCACTCTCGAGTCTTCATTGGTCGATGTAGTTGTACATGAA
AGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTG
CCAGAGAACATGATGGAGAAA
To this:
Lm_g1002.t1-Lp_g998.t1
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
Lm_g1010.t1-Lp_g1011.t2
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCAAACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTCGGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACCGTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGGCGGTGTTGTCATTTCCGG------CGTCGCCGCGGTTTTTGTGATTTG---------------------------------------------------------------------------------GTC
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCAAACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTCGGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACCGTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGTAATTTGGATCTTTACCGGTGTGTTAGTTTGATCGGGTATTGTATGCTACCTATGGTGATTTTGTCCCATCGTGGATTAGTTGCGTATGGTTGCTTTCTTATTTACACTTTGTTTTCGCTTCTTGTCGTGTTT
Lm_g1018.t1-Lp_g1004.t1
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCCCATCAGGGATTGAAAATGTCCCTCGCGGTGTTCAGTGCTTTCAGCATTGGTGTTGCAGTTCCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGGT
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCCCATCAGGGATTGAAAATGTCCCTCGCGGTGTTCACTGCTTTCAGCATTGGTGTTGCAGTTCCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGTT
Lm_g100.t1-Lp_g97.t1
------ATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAACGATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAGGTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTT------------------------------------------------------------------------------------------CTTGAGTCTTCACTGGTGGATGTAGTTGTACATGAAAGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTGCCAGAGAACATGATGGAGAAA
ATGACCATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAACGATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAGGTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTTCTTGAGACATCTTTTGAGAAGGCAATGACAATTATGGACCTCTGTTCTGAATCCTTAGCAGCAGACAAAATGAATGCTGAAAATGCCACTCTCGAGTCTTCATTGGTCGATGTAGTTGTACATGAAAGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTGCCAGAGAACATGATGGAGAAA
I would appreciate your help. Thank you.
Rom
Using Pierre Lindenbaum 's linearize fasta code
There are two sequences per entry, and the provided awk script has merged the two sequences into a single linear sequence. But the objective is to make two linear sequences. Could this be possible please? Thank you!
The above provided script produces one linear sequence:
The goal is to make two linear sequences:
I think that's a tough one. Will the two sequences under each header always be equal lengths? If so, I'm thinking a custom script that can split those lines based on halving the amount of lines under each header.
Here's my attempt (disclaimer, I am a python novice):
This is designed to be placed in a script e.g. "linearize.py" and run with an input file and desired output filename as arguments (script included as google drive link below)
linearize.py
(+1) I edited
usr/bin/python3
to/usr/bin/python3
. Why not posting it as an answer?