Question

File Without Redundancy Using Awk

0

Entering edit mode

11.5 years ago

thiagomafra ▴ 70

Hi people,

i have two fasta file with redundant reads. I want one file with all reads without redundancy, using awk. Someone help me?

awk reads • 3.4k views

ADD COMMENT • link updated 10.4 years ago by Prakki Rama ★ 2.7k • written 11.5 years ago by thiagomafra ▴ 70

5

Entering edit mode

Can I encourage users not to answer questions which fail the "what have you tried" test.

ADD REPLY • link 11.5 years ago by Neilfws 49k

0

Entering edit mode

Seems the answer is no, I cannot :)

ADD REPLY • link 11.5 years ago by Neilfws 49k

score 1 · Answer 1 · 2013-05-29

1

Entering edit mode

11.5 years ago

Matt LaFave ▴ 310

If the IDs of the sequences (the bit after the >) are the same for identical sequences, you could do something like this:

cat file1.fa file2.fa | awk '{if($1 ~ /^>/){name=$1}else{print name"\t"$1}}' | sort | uniq | awk '{print $1"\n"$2}'

If the IDs are not the same, and you're only interested in the sequences themselves, you could get those with sed:

cat file1.fa file2.fa | sed -n '2~2p' | sort | uniq

ADD COMMENT • link 11.5 years ago by Matt LaFave ▴ 310

2

Entering edit mode

Hi Matt, your first command line could be simpler:

awk '{printf (/^>/) ? $0"\t" : $0"\n"}' file1.fa file2.fa | sort -u | tr "\t" "\n"

You can use a conditional expression to shorten the awk command; sort has a -u option to remove duplicates. In your second example, you can use awk to select only the sequences:

awk '! /^>/' file1.fa file2.fa | sort -u

ADD REPLY • link 11.5 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Good to know - thanks!

ADD REPLY • link 11.5 years ago by Matt LaFave ▴ 310

score 0 · Answer 2 · 2013-05-29

0

Entering edit mode

11.5 years ago

Rm 8.3k

Try CD-Hit or Usearch See the similar question: Generating a non-redundant gene set

ADD COMMENT • link 11.5 years ago by Rm 8.3k

score 0 · Answer 3 · 2013-05-30

0

Entering edit mode

11.5 years ago

ewre ▴ 250

not sure if you are looking for tools to remove duplicate lines, if so, using vim this can be done by: in command mode enter :sort ,hit enter to run and then enter :g/^(.*)$\n\1$/d

ADD COMMENT • link 11.5 years ago by ewre ▴ 250

score 0 · Answer 4 · 2014-07-04

0

Entering edit mode

10.4 years ago

Prakki Rama ★ 2.7k

This is not awk, but can be used for the same purpose. See FASTQ/A Collapser.

ADD COMMENT • link 10.4 years ago by Prakki Rama ★ 2.7k