awk -f linearizefasta.awk < input.fa
or
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa
tr "\t" "\n" < linearized.tsv
if you know your fasta header have a length < 60
tr "\t" "\n" < linearized.tsv | fold -w 60
Hi, Many thanks for this, it worked like a charm! If I could ask for your help just one more time, the data is a little bit noisy, and it appears that it is retaining reads that do not contain this sequence. Would you happen to know how to modify the above script to discard any reads that do not contain the specified sequence? Many thanks again zack
When you linearize, you could prefix the input to the sed command with a grep for
atgacccg
, i.e.That way only sequences matching
atgacccg
are retained and processed bysed
.