how linearize nucleotide lines in axt file
0
0
Entering edit mode
5 months ago
resug ▴ 40

Hi,

I am trying to make the nucleotide lines linear, instead of split every 60 nucleotides, and keeping the blank lines between every entry. So the axt file should look from this:

Lm_g1002.t1-Lp_g998.t1
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACT
ATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTAT
CTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACT
ATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTAT
CTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT

Lm_g1010.t1-Lp_g1011.t2
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCA
AACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTC
GGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACC
GTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGGCGGTGT
TGTCATTTCCGG------CGTCGCCGCGGTTTTTGTGATTTG------------------
------------------------------------------------------------
---GTC
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCA
AACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTC
GGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACC
GTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGTAATTTG
GATCTTTACCGGTGTGTTAGTTTGATCGGGTATTGTATGCTACCTATGGTGATTTTGTCC
CATCGTGGATTAGTTGCGTATGGTTGCTTTCTTATTTACACTTTGTTTTCGCTTCTTGTC
GTGTTT

Lm_g1018.t1-Lp_g1004.t1
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCC
CATCAGGGATTGAAAATGTCCCTCGCGGTGTTCAGTGCTTTCAGCATTGGTGTTGCAGTT
CCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGGT
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCC
CATCAGGGATTGAAAATGTCCCTCGCGGTGTTCACTGCTTTCAGCATTGGTGTTGCAGTT
CCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGTT

Lm_g100.t1-Lp_g97.t1
------ATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAAC
GATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAG
GTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTT------
------------------------------------------------------------
------------------------CTTGAGTCTTCACTGGTGGATGTAGTTGTACATGAA
AGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTG
CCAGAGAACATGATGGAGAAA
ATGACCATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAAC
GATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAG
GTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTTCTTGAG
ACATCTTTTGAGAAGGCAATGACAATTATGGACCTCTGTTCTGAATCCTTAGCAGCAGAC
AAAATGAATGCTGAAAATGCCACTCTCGAGTCTTCATTGGTCGATGTAGTTGTACATGAA
AGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTG
CCAGAGAACATGATGGAGAAA

To this:

Lm_g1002.t1-Lp_g998.t1
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT

Lm_g1010.t1-Lp_g1011.t2
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCAAACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTCGGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACCGTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGGCGGTGTTGTCATTTCCGG------CGTCGCCGCGGTTTTTGTGATTTG---------------------------------------------------------------------------------GTC
ATGTCGAAGGAATTCAACGTCCCTCCAGTAGTTTTCCCCTCTGGCGGAAACCCAGGCCCAAACCCTAACCTCCACGACGACGCCGATCTTTCCGGCCCCGTCCTCTGCCTCATGATGTTCGGCCTCTTCCAGCTCCTCGCCGGAAAGATCCACTTCGGTATCATCCTCGGTTGGGTAACCGTTTCTGCGCTTTTTCTCTACGTTGTTTTCAATATGCTTGCTGGTCGTAACGGTAATTTGGATCTTTACCGGTGTGTTAGTTTGATCGGGTATTGTATGCTACCTATGGTGATTTTGTCCCATCGTGGATTAGTTGCGTATGGTTGCTTTCTTATTTACACTTTGTTTTCGCTTCTTGTCGTGTTT

Lm_g1018.t1-Lp_g1004.t1
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCCCATCAGGGATTGAAAATGTCCCTCGCGGTGTTCAGTGCTTTCAGCATTGGTGTTGCAGTTCCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGGT
ATGGCTTTGAATAGCAGTGTCAGATCTACTGCCAAGTTAATCGCTTCTTCTCATTCATCCCATCAGGGATTGAAAATGTCCCTCGCGGTGTTCACTGCTTTCAGCATTGGTGTTGCAGTTCCTATCTATGCTGTCATTTTCCAGCAAAAGAAGACAGCTTCTGTT

Lm_g100.t1-Lp_g97.t1
------ATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAACGATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAGGTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTT------------------------------------------------------------------------------------------CTTGAGTCTTCACTGGTGGATGTAGTTGTACATGAAAGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTGCCAGAGAACATGATGGAGAAA
ATGACCATGGATTTTCACACTCTTTCCAGAAAACAGCTCCAGAGTCTCTGCAAGAAGAACGATATTGAGAAAATGGATGACTTCTCTGATGTTACTGGCACAGCGTTAGCTTCTCTTGAGGTTTCCACTGAATCTTCTGCAGACAAATGCAGCATGGATGCAGAAAATGACTTTCTTGAGACATCTTTTGAGAAGGCAATGACAATTATGGACCTCTGTTCTGAATCCTTAGCAGCAGACAAAATGAATGCTGAAAATGCCACTCTCGAGTCTTCATTGGTCGATGTAGTTGTACATGAAAGCAAGACTACAAATGTTGTAAAGGAAGTGGAGAAGAAAAGAACTGCACTGCAGACACTGCCAGAGAACATGATGGAGAAA

I would appreciate your help. Thank you.

Rom

axt • 557 views
ADD COMMENT
0
Entering edit mode

Using Pierre Lindenbaum 's linearize fasta code

awk '/^Lm/ {printf("%s%s\t",(N>0?"\n\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < original.fa | tr "\t" "\n" > reformat.fa
ADD REPLY
0
Entering edit mode

There are two sequences per entry, and the provided awk script has merged the two sequences into a single linear sequence. But the objective is to make two linear sequences. Could this be possible please? Thank you!

The above provided script produces one linear sequence:

Lm_g1002.t1-Lp_g998.t1
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTTATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT

The goal is to make two linear sequences:

Lm_g1002.t1-Lp_g998.t1
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
ATGAGGTGCTTAAAGATTTTTCAGTTCATAACATCTTTGACTAAATCCTCTGGCTCAACTATAGCTGCTCGGCTAACTGTCTTAGCCCAGCAGTGTTCAACTTGTCATGACTTAGCTTATCTTGGTTCAGCTGGTCTTGGTTTAGCTGATATTGGTTCAAATGTT
ADD REPLY
1
Entering edit mode

I think that's a tough one. Will the two sequences under each header always be equal lengths? If so, I'm thinking a custom script that can split those lines based on halving the amount of lines under each header.

Here's my attempt (disclaimer, I am a python novice):

#!/usr/bin/python3

import sys

# open the input for reading and an output for writing and initialize a list
input = open(sys.argv[1],'r')
output = open(sys.argv[2],'w')
lines = []

# begin iterating through the input.
# if the line is the first line, then record as the header, with leading/trailing whitespace removed.
# iterate through subsequent lines appending them to a list.
# when encountering an empty line (signaling the end of a section), split the list into equal parts and concatenate the list into two separate lines.
# re-initialize the list and restart the iterator to begin a new section, recording the next line as a header and so on.
# after the last line was read, determine if file ended with an empty line.
# If input file ends with empty line, do nothing since last section is already output (including the empty line).
# If input file ends immediately after the string, then output the last section without an empty line.

i = 1

for line in iter(input):
  if i == 1:
    header=line.strip()
    i += 1
  elif line in ['\n', '\r\n']:
    num = int(((i/2) - 2))
    i = 1
    output.write(header+"\n"+"".join(lines[0:num])+"\n"+"".join(lines[num + 1:-1])+'\n'+'\n')
    lines = []
  else:
    lines.append(line.rstrip())
    i += 1

if lines != []:
  num = int(((i/2) - 2))
  output.write(header+"\n"+"".join(lines[0:num])+"\n"+"".join(lines[num + 1:-1])+'\n')

This is designed to be placed in a script e.g. "linearize.py" and run with an input file and desired output filename as arguments (script included as google drive link below)

python3 linearize.py inputFile.txt outputFileName.txt

linearize.py

ADD REPLY
0
Entering edit mode

(+1) I edited usr/bin/python3 to /usr/bin/python3. Why not posting it as an answer?

ADD REPLY

Login before adding your answer.

Traffic: 2401 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6