How to add last 20 nt to the beginning of reads?
5
0
Entering edit mode
7.5 years ago
BehMah ▴ 50

I have a FASTA file with lots of read sequences and I want to copy the last 20 nucleotide of each read and add it to the beginning of the same read.

Can anyone please tell me how to do that? Thanks a lot

Infasta:

>ID1:
ACGTCATCGATCGATGCTAGCTAGCTAGCTAGCTAgtcgcatcgacggcatgcta

Outfasta:

>ID1:
gtcgcatcgacggcatgctaACGTCATCGATCGATGCTAGCTAGCTAGCTAGCTAgtcgcatcgacggcatgcta
sequence RNA-Seq • 2.5k views
ADD COMMENT
2
Entering edit mode

why would you want to do this?

ADD REPLY
0
Entering edit mode

to change linear sequence to circle transcripts

ADD REPLY
0
Entering edit mode

Thanks so much cschu 181 and venu, both codes worked awesome :))

ADD REPLY
0
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

Thank you all for your nice solutions

ADD REPLY
3
Entering edit mode
7.5 years ago
venu 7.1k

Here is something with process substitution (make sure you have linearized fasta)

cat test.fa
>chr1
AGCTAGAGCAGAiiiiij
>chr2
ATCGATAGVCGjjjjji

code

paste <(cat test.fa | paste - - | rev | cut -c 1-5 | rev) <(cat test.fa | paste - -) | awk -F'\t' '{print $2"\n"$1""$3}'

output

>chr1
iiiijAGCTAGAGCAGAiiiiij
>chr2
jjjjiATCGATAGVCGjjjjji

P.S: replace 1-5 with 1-20 for your case.

ADD COMMENT
1
Entering edit mode
7.5 years ago
cschu181 ★ 2.8k
awk '/^>/ { print $0; next; } { tail=substr($0, length($0) - 19); printf("%s%s\n", tail, $0); }' test.fa

But I would also be interested in what you're trying to achieve with this, if you don't mind...

ADD COMMENT
0
Entering edit mode

Thanks cschu181,

I want to make circular sequence of the linears so I have back-splice junctions

ADD REPLY
0
Entering edit mode

Dear cscxhu181, Thnx again. But it adds the tail not to the beginning but further upstream of the tail :(

ADD REPLY
0
Entering edit mode

sorry, I think the problem is my input file not your nice command :) I will let you know if it doesn't work. Thnx

ADD REPLY
1
Entering edit mode
7.3 years ago
Charles Plessy ★ 2.9k

Prompted by the Python and Perl code, and just for the fun of it, here is a Haskell version (that probably shows that I am very much a beginner).

#!/usr/bin/runhaskell
main = interact $ unlines . map add20 . lines
add20 ('>':x) = ">" ++ x
add20 x       = reverse (take 20 (reverse x)) ++ x

But for such tasks one definitely should use shell commands as in the accepted answers.

ADD COMMENT
0
Entering edit mode
7.5 years ago
st.ph.n ★ 2.7k

Python:

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
    for line in f:
        if line.startswith('>'):
            print (line.strip())
        else:
            print(line.strip()[-20:] + line.strip())

Save as append_last20.py, run as python append_last20.py input.fasta

ADD COMMENT
0
Entering edit mode
7.5 years ago
Harry ▴ 10

Perl (Using my own input file similar to yours):

Input: seq1

>ID1:
ATCGATGCGCTATATCGCGATCGATGCTATGCgggggCTACGCTAGCTTCAG
>ID2:
AGCTAGCTGGATATGCTATCGATGCTAGCTACgggggCTAGCATGCTGATCG
>ID3:
GCGTAGCAGCATGCATGTATAGCTGCATCGAAgggggCATGCATGCTAGCTA

Code:

use strict;
use warnings;
my $regex = '.{20}$';
my $last20 = "";

my $filename = 'seq1';
open(my $fh, '<encoding(UTF-8)', $filename)
    or die "Could not open fie '$filename' $!";

while (my $row = <$fh>) {
    chomp $row;
    if($row =~ /^[a-zA-Z]/){
        if($row =~ /($regex)/){
            $last20 = $1;
            $last20 .= $row;
            print "$last20\n";
        }
    }
    else{
        print "$row\n";
    }
}

Output:

>ID1:
gggggCTACGCTAGCTTCAGATCGATGCGCTATATCGCGATCGATGCTATGCgggggCTACGCTAGCTTCAG
>ID2:
gggggCTAGCATGCTGATCGAGCTAGCTGGATATGCTATCGATGCTAGCTACgggggCTAGCATGCTGATCG
>ID3:
gggggCATGCATGCTAGCTAGCGTAGCAGCATGCATGTATAGCTGCATCGAAgggggCATGCATGCTAGCTA
ADD COMMENT
0
Entering edit mode

I think the format shown in original thread was not exact format OP wanted. Actually it is fasta but biostars assumes a line starting with > sign as block quote. Hence this different format.

@OP: Better change the format as described here

ADD REPLY
0
Entering edit mode

Good catch, I've modified the post to better reflect the data OP has.

ADD REPLY
0
Entering edit mode

Well done :) I guess you missed this thread.

ADD REPLY
0
Entering edit mode

Gha, I feel like a bot sometimes :-o

ADD REPLY
0
Entering edit mode

Thanks for the heads up, I have modified it to fit the correct format. This was my first reply as a user so I am still learning! :)

ADD REPLY

Login before adding your answer.

Traffic: 1872 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6