Question

Convert Non-Redundant Back To Redundant Fasta File

1

Entering edit mode

11.5 years ago

redspider19800915 ▴ 40

I have a non-redundant FASTA file in following format:

 TCACCCATCGTACCCACTTG    1
 TTTTTGATCCTTCGATGTCGGC    64
 TCTTGAAGTAGAAAAGTTGTGGTT    2
 CGTAAGAATGTCCACAGCCAAGC    1
......

the 2nd column is the abundance of the corresponding read. I would like to have a new FASTA file containing all redundant sequences, i.e. the 1st read appears once; the 2nd appears 64 times; the 3rd one appears twice...

Could anyone help?

perl • 4.0k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 11.5 years ago by redspider19800915 ▴ 40

4

Entering edit mode

http://whathaveyoutried.com

ADD REPLY • link 11.5 years ago by Pierre Lindenbaum 164k

4

Entering edit mode

Here's a hint using Pierre's comment as input:

$ perl -e 'print "http://whathaveyoutried.com\n\n" x 2'

http://whathaveyoutried.com

ADD REPLY • link 11.5 years ago by SES 8.6k

0

Entering edit mode

I love the people here.

ADD REPLY • link 11.5 years ago by Ashutosh Pandey 12k

0

Entering edit mode

So do I. In case it's not obvious, you would replace the URL in my code above with the DNA string (in the first column of the example input) and the '2' with the appropriate number (in the second column). Perl will autosplit input from the command line (giving you the values in those columns), so you would only need make a slight modification to add a header.

ADD REPLY • link 11.5 years ago by SES 8.6k

score 8 · Answer 1 · 2013-05-23

8

Entering edit mode

11.5 years ago

Frédéric Mahé ★ 3.2k

Here is a solution using awk:

awk '{for (i=1 ; i <= $2 ; i++) {print ">seq_" NR "_" i "\n" $1}}' dereplicated_file > redundant.fasta

The output looks like that:

>seq_1_1
TCACCCATCGTACCCACTTG
>seq_2_1
TTTTTGATCCTTCGATGTCGGC
>seq_2_2
TTTTTGATCCTTCGATGTCGGC
>seq_2_3
TTTTTGATCCTTCGATGTCGGC
...

ADD COMMENT • link 11.5 years ago by Frédéric Mahé ★ 3.2k

score 0 · Answer 2 · 2013-05-23

Hi,

Do you know python? From technical perspective this file does not look like a FASTA file. You can simply parse it line by line spiting by "\t" and then write it to the new file given number of times.

Something like this:

with open("file.txt", 'r') as ifile:
    with open("file2.txt", 'w') as ofile:
        for line in ifile:
            line = line.strip().split("\t")
            seq = line[0]
            number = int(line[1])
            for i in range(0, number):
                 ofile.write("%s\n" % seq)

Cheers!

score 0 · Answer 3 · 2013-05-23

0

Entering edit mode

11.5 years ago

Martin A Hansen 3.0k

With Biopieces www.biopieces.org) do:

read_tab -i test.tab -k SEQ,COUNT | duplicate_record -k COUNT | add_ident -k SEQ_NAME | merge_vals -k SEQ_NAME,COUNT | write_fasta -xo out.fasta

ADD COMMENT • link 11.5 years ago by Martin A Hansen 3.0k

score 0 · Answer 4 · 2013-05-26

Here's a Perl option:

perl -ane 'print ">Seq_$._$_\n$F[0]\n" for 1..$F[1]' inFile > outFile

Partial output on your dataset:

>Seq_1_1
TCACCCATCGTACCCACTTG
>Seq_2_1
TTTTTGATCCTTCGATGTCGGC
>Seq_2_2
TTTTTGATCCTTCGATGTCGGC
...
>Seq_2_63
TTTTTGATCCTTCGATGTCGGC
>Seq_2_64
TTTTTGATCCTTCGATGTCGGC
>Seq_3_1
TCTTGAAGTAGAAAAGTTGTGGTT
>Seq_3_2
TCTTGAAGTAGAAAAGTTGTGGTT
>Seq_4_1
CGTAAGAATGTCCACAGCCAAGC