How to extract a fasta sequence row using AWK into new file that has no "_" in it.
4
0
Entering edit mode
9.0 years ago
usamaabid3 • 0
cat myseq.fasta | grep '>' | head -n 5

>PZ7180000000004_TX nReads=26 cov=9.436
>PZ7180000031590 nReads=3 cov=2.59465
>PZ7180000027934 nReads=5 cov=2.32231
>PZ456916 nReads=1 cov=1
>PZ7180000037718 nReads=9 cov=6.26448

The IDs represent a unique identifier; those that have the _ have been grouped by sequence similarity based on BLAST results. For example,PZ7180000000447_A and PZ15399_A are in the same group, whilePZ7180000000079_AF and PZ5729_AF are together in a different group. Those without any _ suffix had no homology to other sequences, and so weren't put in a group.

I want to put non-group in file named no_suffix.fasta

I want to do this all using AWK, Grep and SED. Please help

sequence awk fasta • 3.5k views
ADD COMMENT
0
Entering edit mode

Why only awk, sed and grep? Is this an assignment?

ADD REPLY
0
Entering edit mode

No I have studied those I also understand bioawk... I tried python but it didn't worked for me was tough :(

ADD REPLY
0
Entering edit mode

It might be a good idea to use bioawk then. You should not have to design yet another parser for an ad hoc task.

ADD REPLY
2
Entering edit mode
9.0 years ago

linearize the fasta (see here) , filter out the sequences containing a _ using grep -v '_', convert back to fasta and redirect to no_suffix.fasta

ADD COMMENT
0
Entering edit mode

Why linearize, Pierre?

ADD REPLY
2
Entering edit mode
9.0 years ago
$ pip install pyfaidx
$ faidx --regex [^_] input.fasta > output.fasta
ADD COMMENT
1
Entering edit mode
9.0 years ago
Ram 44k

This is simple text processing, not bioinformatics, but I'll let the other mods decide if this question should remain open. Essentially, you can use bioawk to parse the fasta. If you do not wish to use any intelligent parser (which is a bad decision), use grep with '^>[^_]+' to get the relevant headers, and use it to get your sequences.

Or, use an existing parser so you can compare the header and copy over the sequence in a single step without overly complicating your command.

ADD COMMENT
0
Entering edit mode
9.0 years ago
Daniel ★ 4.0k

I think this is much simpler than above, using the -A flag for "number of lines after match". If your fasta is already in two line format (no newlines within the sequence) just do:

grep '_' -A 1 myfile.fasta >groups_only.fasta
grep -v '_' -A 1 myfile.fasta >no_groups.fasta

If you need to convert a multi-line fasta into a single-line fasta, I use this simple script:

#!/usr/bin/perl
$filename = $ARGV[0];
chomp(@lines = <>);

foreach $line (@lines){
        if($line =~ /^>/){
                $line = "\n$line\n";
        }
        if($line != /^$/){
                print "$line";
        }
}
ADD COMMENT

Login before adding your answer.

Traffic: 1569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6