Bioawk is an extension to Brian Kernighan's awk created by Heng Li that adds support for several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q as well as generic TAB-delimited formats with the column names.
Code
The source code can be found at: bioawk GitHub page. Users will need to download and run make to compile it. In the examples below it is assumed that this version of awk is being used.
Documentation
There is a short manual page in the main distribution and a longer HTML formatted help page
Examples
Extract unmapped reads without header:
awk -c sam 'and($flag,4)' aln.sam.gz
Extract mapped reads with header:
awk -c sam -H '!and($flag,4)'
Reverse complement FASTA:
awk -c fastx '{ print ">"$name;print revcomp($seq) }' seq.fa.gz
Create FASTA from SAM (uses revcomp if FLAG & 16)::
samtools view aln.bam | \
awk -c sam '{ s=$seq; if(and($flag, 16)) {s=revcomp($seq) } print ">"$qname"\n"s}'
Get the %GC from FASTA:
awk -c fastx '{ print ">"$name; print gc($seq) }' seq.fa.gz
Get the mean Phred quality score from FASTQ:
awk -c fastx '{ print ">"$name; print meanqual($qual) }' seq.fq.gz
Take column name from the first line (where "age" appears in the first line of input.txt):
awk -c header '{ print $age }' input.txt
It should be noted that
gc($seq)
doesn't excludeN
s from the calculation, soACGTNNNNACGT
results in 0.333333.I wonder what the acceptable answer is for this case. One could ignore the Ns or count them as 1/4, or as bioawk does it here count them all.
I was just going to post this. :)
indeed, the previous discussion made us all realize what a good fit it is for this section
i used bioawk to calculate mean quality score of fastq and it gave me one mean per each read. now, how can i calculate overall quality mean using output of bioawk?
Questions need to be asked separately as a new entry and not as a comment or answer to a post.