Question

Tool:Bioawk - Fasta, Fastq, Sam, Bed, Gff Aware Awk Programming

32

Entering edit mode

12.4 years ago

Istvan Albert 101k

Bioawk is an extension to Brian Kernighan's awk created by Heng Li that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q as well as generic TAB-delimited formats with the column names.

Code

The source code can be found at: bioawk GitHub page. Users will need to download and run make to compile it. In the examples below it is assumed that this version of awk is being used.

Documentation

There is a short manual page in the main distribution and a longer HTML formatted help page

Examples

Extract unmapped reads without header:

awk -c sam 'and($flag,4)' aln.sam.gz

Extract mapped reads with header:

awk -c sam -H '!and($flag,4)'

Reverse complement FASTA:

awk -c fastx '{ print ">"$name;print revcomp($seq) }' seq.fa.gz

Create FASTA from SAM (uses revcomp if FLAG & 16)::

samtools view aln.bam | \
  awk -c sam '{ s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}'

Get the %GC from FASTA:

awk -c fastx '{ print ">"$name; print gc($seq) }' seq.fa.gz

Get the mean Phred quality score from FASTQ:

awk -c fastx '{ print ">"$name; print meanqual($qual) }' seq.fq.gz

Take column name from the first line (where "age" appears in the first line of input.txt):

awk -c header '{ print $age }' input.txt

awk • 17k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 12.4 years ago by Istvan Albert 101k

2

Entering edit mode

It should be noted that gc($seq) doesn't exclude Ns from the calculation, so ACGTNNNNACGT results in 0.333333.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

I wonder what the acceptable answer is for this case. One could ignore the Ns or count them as 1/4, or as bioawk does it here count them all.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by Istvan Albert 101k

0

Entering edit mode

I was just going to post this. :)

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

indeed, the previous discussion made us all realize what a good fit it is for this section

ADD REPLY • link 12.4 years ago by Istvan Albert 101k

0

Entering edit mode

i used bioawk to calculate mean quality score of fastq and it gave me one mean per each read. now, how can i calculate overall quality mean using output of bioawk?

ADD REPLY • link 7.3 years ago by reza ▴ 300

0

Entering edit mode

Questions need to be asked separately as a new entry and not as a comment or answer to a post.

ADD REPLY • link 7.3 years ago by Istvan Albert 101k