Getting rid of spaces in txt file with nucleotide sequences
1
0
Entering edit mode
4.6 years ago
stambukf ▴ 10

Hi everyone. I'm new int this forum but i had a lot of information of RNA-seq experiments. In one of them i have a file with non matched sequences in a .txt format. But i have a problem: it doesn't hav the initial ">" and the sequences are in this format (examples):

NI_1_(paired)_trimmed_(paired)_contig_2465 (here is space and then the immediatly the sequence) ATAGACATAAATGATAATTTACATGGTAACGTAAACAAAGCCAATGAAAAATTAACATGTAAACATGCTTTGCCCAACATAATTTCTGAACGGATTATATTTACACAGTAAAAATATGAAAATGAAATTGCCATTTTTGAAATAATACCTAACAAAAATACCGGTAATTACAATATTTACAACACATTAAACAATTGAATACAACTTGGAATGCATATTTGTCTTGTTTTTCATGCAATGACTGTACATGACCTCATTTCGATAAATTGTTGCAACACATAAACTACAGCACATGTACATGTACTGCTGAAAGGCATATACTGTATATGGTTATGATGGGAAATGATGTTGAACTTATAATATGCATTGAACAATACAAATTGCATTTACACCACTTGTATGCACTAATATGTTTTATAGTCATGATTTCACAAACTTTTTACTAGTACAAGGATTGGGATTGGCAATTAAGAATAGCATAATTTTTGCATAATATGTGCATCACTGCCAATATATTCACATTTAACTATTTGTCTTACAAAATACAAAAGCACATTTATAGATCATTATTTACTTCTACTTTATTTCTATTTATTTATTCAAGATCAAAAAATGTTTTATTGACATATATACTTTTCTAAAAATCAAAATGTCACGATTTTTGAAATTCATACTGTTATGAAAAAAATACAGACACTTAAAAAAAAAAGAATGAAAACTTGTTTGTTACCTGGCGAGTAACAATAAAAGAAAAAAAAAATATCAATACTTATAAAGTATTCAATTACAGTATGTTGCGCTATTTCACTATTTCACAAGTGACTCCAACACAACGGTACAAAATATTCATTCATAAAGACATAAACCAGAGACTGATTGCACAAAGCTGCCCTAGAGTTTAGTTATCGACAAAATACATTTTTTTCTATATACAGTAAAACATTCCACCTGCAAATTGTATCGCTAGAAAACCATGGTCTATTCTTAGAATTGTTTTCCAACAAAAGCAAACTTGGATCAATTCCACTCATTAAGAATACTGGTGCTAAAAATGAAACACAGGTACATGTACTTTAGTAATTATAATAGGGAACATTTAACTTCATCAATCATTATCCAACATTCATAAGATAGCATTGAAGCAGACTAGTGGTGTACAGCAGAAACCAGCCAGAGCAAGGTCTAGATCTACTGTATTGCCCTATATATCTATTCAATTACCGTATTTTCCGGACAATAAGTCACACCTGTGTATAAGTCCCAAAGGCCTTTTTTACGATTTTTTCAGTTTTCATACACACATAAATCGCACTCCCGATTTTCAGGNNNNNNNNNNNNNAAAAAGTCGCGACTTATACTCCAGAAAATACGGTAACCTTTTACACATGTATGTTGAACACTCAGAACATTGTAAGGTGTTACAAAATTAGTCCAGTGGAAGGGTCAAGACAATATGACAAAGCAAATATGGCACAAAACAACTCACAACTGAACTGGAACAGGTGGTTGTCATGGACGAAAGCTGAGGTACCGTGGAGAAATGATGTATCACATTAAAATACCATTGTAGGCTTTCAACCAACAGATACATAATATGCATGTCAAATATACATGTGACAATCACAGGCCTTTCTTAGTGACAATCATAATACATGTGCGCCTACCAGGAAACATCTGTAAATTTCAATGTTGCATGAATTTCCTTAAAATCTTTCATACTTTCTTATAGAAACAAAAAACAAAGTAATCCATACCAATCCAAATCCAAAACACTTACAACTGTTACACATCTAAGTCACAACTATTCATAACACATCTATACTATTACATTTATAAGTCACTTACAACAGTTACACATCTTAGACTAAGTCACTTCGAACAGCAATACCCAACTGTTACTGTTACACATCTAAGTCACTTACAACTATTACACGTCTAAGTCACTTAATTCGAACTGTTACACATCTAAGTCACTCTTAAGTTACTTAAATCGAACTGTTTCACATCTAAGTCACTTACAACTGTTGCACATCTAAGTCACTTACAACTGTTGCACATCTAAGTCACTTTAAACTGTTACACGTCTAAGTCAATCACAACTGTTGCACATCTAAGTCACTTACAACTGCATATTGCATCCTTTTATACACATCACAGATCTACTGTAACATAAAAAACAATTTGAAATAAATCACAGTCAAATATTCACTCAACATCATATACACTACCGTAACAATTAATCTCCTTTCAAATGGAAACTTCCCCTGATGATCTACATCAGCAGTTTAATATAGTGATATCCAGAATATTACTATACTGCATTTGCTGAAACTATAGCAACCCTTATTTCTGTAAAAGCGCCCATCTCCCTTTCAGGGAAGAAAATCTTACATATACAGTATAACTGTAATTAGTTTTTTTTGTTAGTAAAAAGTGCAAAAGCTGAACTTGGGTTCTGTACAAGTATCAATGAATCTAGAAATGTCGGTGCCAAAATATGGATGAGGCGCAACACGACACAACACAATGGACATGGACTAATGTATGTGACCTCGATTTCACAGTGGGAATATCAAAGGGAAGATGAAATATTTAGTATCTGTAGATTATTCTGAGATTGAGCTTAACATTCCAATTTTTTTTTTTTTTTTTGCATTTTGTAAAACAGAAACTGCACTAAATATTGATATCCATAGAAATCAGTATCAAATCTATGGTGCACTCAATGATTTAAATTTTCTACAAATGCATAAAACACAGAAACAATATTCCTGTACCTGTATCTTTAAAAATTAACTTGTTTTCAAAGTGATTCCATTTCACTGCCTATTCAAAACCCAAGTTCTGTAGTGATTTCTAATAGCTTAGAATAAGTCTAGTTGACTTGAAAGTTTTGATTCACATCATCAGATTCTAACATATAGCTTATAAACAATTCTGAACGTACTGGGTCTACTTGTTTAAAACTACTTTAGAAGTTAACTTTCCTTAACTCTTAGATATTGAATTGCTAGCAATTTTCCTCATTTTGAGATATATTCTATTCCGAATTTGCATATTACAGAGTTATCTGCACCTGCGGG NI_1_(paired)_trimmed_(paired)_contig_2468 (same here) GCTCAACTCAATGTCTGTGATCCTCTCCATTGTTCTCTCCTAGAATAAAAACAAAAGTACAAAATCAGCGCTCATTCATATCAATCATAGGAAAGTACCTATATCTTTATAGAGAGAACACAAGCACCCACATGTTATATTGCAATTAGCTGGATGCATACCCCTACCATGATATGTTAACTTAAAACCCAAGGTTTATATATATTGACTAGCACTTATAATGCTAAATGATATAGCATTATTAAAATGGGAAATTAAAAGTAATACAATTGAACGTAAGTGCTATACAATGAACATAGCTTATGCAACATTTCCATGCATTGCTTTTATTTGCATTAAAATAAACTTTTAAATCATGCAATGATAATAAGAAAATCAGGAAAAGCCATCAACCAGCAAAAAATTCAAAACAAGAATTAAAATTTTAAATCATAAATCATTACTGTGCATTTAAGGAAGGTTATCATCAACAGTATAATAGGCAGTGATTAATTTTGAAAGCAATAATATAAGACAACTGGGATCAATGTTTTGCAGCTAAAAGTGCTATTAATGCACTAGTACCTGTAAAAGCAGATTGAAAAAGAACATTGCAGCCAGTCGATAAAAGCCATGTTTATCACAACCTTTTCAGCAAGACCAATACATTTTTATATGAGCCTGACGCCTTCATTTTACATTGAATTTGAATTTACAACTGGATGCCATTTAGCAATGTCCTGATAACAGGTCAAAAGGATGAAGGAATTTGCAGGAAAAGCGAAGAGGATCTATATCTATAAGCTAGCTATATATATAGAGGCATTGCAGGTAGCGAAGGGCAAAACCTGACACAAGGTATACGTACGCATGCTAGAAAATGAGATAATTGACATTTCATTGTCAAAAATAATGTTCAGGTTTTAATATCTATTTGAAGAATAATCTTATCAGCTTCAATAATGATTTTCTTTAAAAACTAGATAAATGCACCCAAAAAAAATGTTTTTGAAGTATGTTTATATATTT

I already add the > symbol with some awk script but i can't get rid of the space and separate the sequences from the blank spaces.

I need sometime like this:

NI_1_(paired)_trimmed_(paired)_contig_2465 ATAGACATAAATGATAATTTACATGGTAACGTAAACAAAGCCAATGAAAAATTAACATGTAAACATGCTTTGCCCAACATAATTTCTGAACGGATTATATTTACACAGTAAAAATATGAAAATGAAATTGCCATTTTTGAAATAATACCTAACAAAAATACCGGTAATTACAATATTTACAACACATTAAACAATTGAATACAACTTGGAATGCATATTTGTCTTGTTTTTCATGCAATGACTGTACATGACCTCATTTCGATAAATTGTTGCAACACATAAACTACAGCACATGTACATGTACTGCTGAAAGGCATATACTGTATATGGTTATGATGGGAAATGATGTTGAACTTATAATATGCATTGAACAATACAAATTGCATTTACACCACTTGTATGCACTAATATGTTTTATAGTCATGATTTCACAAACTTTTTACTAGTACAAGGATTGGGATTGGCAATTAAGAATAGCATAATTTTTGCATAATATGTGCATCACTGCCAATATATTCACATTTAACTATTTGTCTTACAAAATACAAAAGCACATTTATAGATCATTATTTACTTCTACTTTATTTCTATTTATTTATTCAAGATCAAAAAATGTTTTATTGACATATATACTTTTCTAAAAATCAAAATGTCACGATTTTTGAAATTCATACTGTTATGAAAAAAATACAGACACTTAAAAAAAAAAGAATGAAAACTTGTTTGTTACCTGGCGAGTAACAATAAAAGAAAAAAAAAATATCAATACTTATAAAGTATTCAATTACAGTATGTTGCGCTATTTCACTATTTCACAAGTGACTCCAACACAACGGTACAAAATATTCATTCATAAAGACATAAACCAGAGACTGATTGCACAAAGCTGCCCTAGAGTTTAGTTATCGACAAAATACATTTTTTTCTATATACAGTAAAACATTCCACCTGCAAATTGTATCGCTAGAAAACCATGGTCTATTCTTAGAATTGTTTTCCAACAAAAGCAAACTTGGATCAATTCCACTCATTAAGAATACTGGTGCTAAAAATGAAACACAGGTACATGTACTTTAGTAATTATAATAGGGAACATTTAACTTCATCAATCATTATCCAACATTCATAAGATAGCATTGAAGCAGACTAGTGGTGTACAGCAGAAACCAGCCAGAGCAAGGTCTAGATCTACTGTATTGCCCTATATATCTATTCAATTACCGTATTTTCCGGACAATAAGTCACACCTGTGTATAAGTCCCAAAGGCCTTTTTTACGATTTTTTCAGTTTTCATACACACATAAATCGCACTCCCGATTTTCAGGNNNNNNNNNNNNNAAAAAGTCGCGACTTATACTCCAGAAAATACGGTAACCTTTTACACATGTATGTTGAACACTCAGAACATTGTAAGGTGTTACAAAATTAGTCCAGTGGAAGGGTCAAGACAATATGACAAAGCAAATATGGCACAAAACAACTCACAACTGAACTGGAACAGGTGGTTGTCATGGACGAAAGCTGAGGTACCGTGGAGAAATGATGTATCACATTAAAATACCATTGTAGGCTTTCAACCAACAGATACATAATATGCATGTCAAATATACATGTGACAATCACAGGCCTTTCTTAGTGACAATCATAATACATGTGCGCCTACCAGGAAACATCTGTAAATTTCAATGTTGCATGAATTTCCTTAAAATCTTTCATACTTTCTTATAGAAACAAAAAACAAAGTAATCCATACCAATCCAAATCCAAAACACTTACAACTGTTACACATCTAAGTCACAACTATTCATAACACATCTATACTATTACATTTATAAGTCACTTACAACAGTTACACATCTTAGACTAAGTCACTTCGAACAGCAATACCCAACTGTTACTGTTACACATCTAAGTCACTTACAACTATTACACGTCTAAGTCACTTAATTCGAACTGTTACACATCTAAGTCACTCTTAAGTTACTTAAATCGAACTGTTTCACATCTAAGTCACTTACAACTGTTGCACATCTAAGTCACTTACAACTGTTGCACATCTAAGTCACTTTAAACTGTTACACGTCTAAGTCAATCACAACTGTTGCACATCTAAGTCACTTACAACTGCATATTGCATCCTTTTATACACATCACAGATCTACTGTAACATAAAAAACAATTTGAAATAAATCACAGTCAAATATTCACTCAACATCATATACACTACCGTAACAATTAATCTCCTTTCAAATGGAAACTTCCCCTGATGATCTACATCAGCAGTTTAATATAGTGATATCCAGAATATTACTATACTGCATTTGCTGAAACTATAGCAACCCTTATTTCTGTAAAAGCGCCCATCTCCCTTTCAGGGAAGAAAATCTTACATATACAGTATAACTGTAATTAGTTTTTTTTGTTAGTAAAAAGTGCAAAAGCTGAACTTGGGTTCTGTACAAGTATCAATGAATCTAGAAATGTCGGTGCCAAAATATGGATGAGGCGCAACACGACACAACACAATGGACATGGACTAATGTATGTGACCTCGATTTCACAGTGGGAATATCAAAGGGAAGATGAAATATTTAGTATCTGTAGATTATTCTGAGATTGAGCTTAACATTCCAATTTTTTTTTTTTTTTTTGCATTTTGTAAAACAGAAACTGCACTAAATATTGATATCCATAGAAATCAGTATCAAATCTATGGTGCACTCAATGATTTAAATTTTCTACAAATGCATAAAACACAGAAACAATATTCCTGTACCTGTATCTTTAAAAATTAACTTGTTTTCAAAGTGATTCCATTTCACTGCCTATTCAAAACCCAAGTTCTGTAGTGATTTCTAATAGCTTAGAATAAGTCTAGTTGACTTGAAAGTTTTGATTCACATCATCAGATTCTAACATATAGCTTATAAACAATTCTGAACGTACTGGGTCTACTTGTTTAAAACTACTTTAGAAGTTAACTTTCCTTAACTCTTAGATATTGAATTGCTAGCAATTTTCCTCATTTTGAGATATATTCTATTCCGAATTTGCATATTACAGAGTTATCTGCACCTGCGGG NI_1_(paired)_trimmed_(paired)_contig_2468 GCTCAACTCAATGTCTGTGATCCTCTCCATTGTTCTCTCCTAGAATAAAAACAAAAGTACAAAATCAGCGCTCATTCATATCAATCATAGGAAAGTACCTATATCTTTATAGAGAGAACACAAGCACCCACATGTTATATTGCAATTAGCTGGATGCATACCCCTACCATGATATGTTAACTTAAAACCCAAGGTTTATATATATTGACTAGCACTTATAATGCTAAATGATATAGCATTATTAAAATGGGAAATTAAAAGTAATACAATTGAACGTAAGTGCTATACAATGAACATAGCTTATGCAACATTTCCATGCATTGCTTTTATTTGCATTAAAATAAACTTTTAAATCATGCAATGATAATAAGAAAATCAGGAAAAGCCATCAACCAGCAAAAAATTCAAAACAAGAATTAAAATTTTAAATCATAAATCATTACTGTGCATTTAAGGAAGGTTATCATCAACAGTATAATAGGCAGTGATTAATTTTGAAAGCAATAATATAAGACAACTGGGATCAATGTTTTGCAGCTAAAAGTGCTATTAATGCACTAGTACCTGTAAAAGCAGATTGAAAAAGAACATTGCAGCCAGTCGATAAAAGCCATGTTTATCACAACCTTTTCAGCAAGACCAATACATTTTTATATGAGCCTGACGCCTTCATTTTACATTGAATTTGAATTTACAACTGGATGCCATTTAGCAATGTCCTGATAACAGGTCAAAAGGATGAAGGAATTTGCAGGAAAAGCGAAGAGGATCTATATCTATAAGCTAGCTATATATATAGAGGCATTGCAGGTAGCGAAGGGCAAAACCTGACACAAGGTATACGTACGCATGCTAGAAAATGAGATAATTGACATTTCATTGTCAAAAATAATGTTCAGGTTTTAATATCTATTTGAAGAATAATCTTATCAGCTTCAATAATGATTTTCTTTAAAAACTAGATAAATGCACCCAAAAAAAATGTTTTTGAAGTATGTTTATATATTT

I wish the community can help me.

Grettings.

RNA-Seq next-gen gene • 1.1k views
ADD COMMENT
1
Entering edit mode

I finally made it. Thanks for your work @Kevin Blighe.

ADD REPLY
1
Entering edit mode

with awk:

input:

$ cat test.txt                                                                                                                 
>NI_1_(paired)_trimmed_(paired)_contig_2465 ATAGACATAAATGATAATTTACATGGTAACGTAAACAAAGCCAATGAAAAATTAACATGT...
>NI_1_(paired)_trimmed_(paired)_contig_2468 GCTCAACTCAATGTCTGTGATCCTCTCCATTGT...

output:

$ awk '{print $1"\n"$2}' test.txt                                                                                              
>NI_1_(paired)_trimmed_(paired)_contig_2465
ATAGACATAAATGATAATTTACATGGTAACGTAAACAAAGCCAATGAAAAATTAACATGT...
>NI_1_(paired)_trimmed_(paired)_contig_2468
GCTCAACTCAATGTCTGTGATCCTCTCCATTGT...
ADD REPLY
1
Entering edit mode
4.6 years ago

Greetings,

You can achieve this via sed. Here is an example:

Original file:

cat test

NI_1_(paired)_trimmed_(paired)_contig_2465 (here is space and then the immediatly the sequence)


ATAGACATAAATGATAATTTACATGGTAACGTAAA...


NI_1_(paired)_trimmed_(paired)_contig_2468 (same here)


GCTCAACTCAATGTCTGTGATC

>header1


GCGATACACGACGCAGTCAGAGATGATGCTG
>header2


ATATTATTGGCCTGTAAT

Solution:

sed '/^$/d' test

NI_1_(paired)_trimmed_(paired)_contig_2465 (here is space and then the immediatly the sequence) 
ATAGACATAAATGATAATTTACATGGTAACGTAAA...
NI_1_(paired)_trimmed_(paired)_contig_2468 (same here)
GCTCAACTCAATGTCTGTGATC
>header1
GCGATACACGACGCAGTCAGAGATGATGCTG
>header2
ATATTATTGGCCTGTAAT

Kevin

ADD COMMENT
0
Entering edit mode

Thanks for the answer but doesn't solutionate my problem.

Maybe i'm going to be more specific: I have this sequences:

NI_1_(paired)_trimmed_(paired)_contig_2465 ATAGACATAAA NI_1_(paired)_trimmed_(paired)_contig_2468 GCTCAACTCAA

(THERE ARE ONLY 2 EXAMPLES. I HAVE 10.000 MORE IN A TXT. FILE IN THE SAME FILE).

And i want it like this:

NI_1_(paired)_trimmed_(paired)_contig_2465 ATAGACATAAA NI_1_(paired)_trimmed_(paired)_contig_2468
GCTCAACTCAA etc etc.

Hope that anyone can helps me.

ADD REPLY
0
Entering edit mode

Hi! I see... the problem is a formatting issue on this website.

As I understand, this is what you want:

cat test 
>NI_1_(paired)_trimmed_(paired)_contig_2465                 ATAGACATAAATGATAATTTACATGGTAACGTAAACAAAGCCAATGAAAAATTAACATGT...
>NI_1_(paired)_trimmed_(paired)_contig_2468     GCTCAACTCAATGTCTGTGATCCTCTCCATTGT...

sed 's/ \+/\n/g' test 
>NI_1_(paired)_trimmed_(paired)_contig_2465
ATAGACATAAATGATAATTTACATGGTAACGTAAACAAAGCCAATGAAAAATTAACATGT...
>NI_1_(paired)_trimmed_(paired)_contig_2468
GCTCAACTCAATGTCTGTGATCCTCTCCATTGT...

Let me know. This assumes that these spaces are exclusively white spaces, i.e., not tabs.

ADD REPLY
0
Entering edit mode

Exactly! That is what i want. The problem is that this sequences are in a .txt file. Do i have to run the script like this? sed 's/ +/\n/g' infile.txt

Tell me if it is okay. Thanks for the aclaration.

ADD REPLY
0
Entering edit mode

Yes, that should work, apart from the fact that you need a slash before the '+' sign, like in my code.

You can save it to a new file to check it, like this:

sed 's/ \+/\n/g' infile.txt > outfile ;

If you are feeling super confident, sed can edit a file in place via the ´-i´ parameter:

sed -i 's/ \+/\n/g' infile.txt ;
ADD REPLY

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6