Hi people,
i have two fasta file with redundant reads. I want one file with all reads without redundancy, using awk. Someone help me?
Hi people,
i have two fasta file with redundant reads. I want one file with all reads without redundancy, using awk. Someone help me?
If the IDs of the sequences (the bit after the >) are the same for identical sequences, you could do something like this:
cat file1.fa file2.fa | awk '{if($1 ~ /^>/){name=$1}else{print name"\t"$1}}' | sort | uniq | awk '{print $1"\n"$2}'
If the IDs are not the same, and you're only interested in the sequences themselves, you could get those with sed
:
cat file1.fa file2.fa | sed -n '2~2p' | sort | uniq
Hi Matt, your first command line could be simpler:
awk '{printf (/^>/) ? $0"\t" : $0"\n"}' file1.fa file2.fa | sort -u | tr "\t" "\n"
You can use a conditional expression to shorten the awk
command; sort
has a -u
option to remove duplicates. In your second example, you can use awk
to select only the sequences:
awk '! /^>/' file1.fa file2.fa | sort -u
Try CD-Hit or Usearch See the similar question: Generating a non-redundant gene set
not sure if you are looking for tools to remove duplicate lines, if so, using vim this can be done by: in command mode enter :sort ,hit enter to run and then enter :g/^(.*)$\n\1$/d
This is not awk, but can be used for the same purpose. See FASTQ/A Collapser.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Can I encourage users not to answer questions which fail the "what have you tried" test.
Seems the answer is no, I cannot :)