Whereas an infinity of efficient tools exists out there, it is sometimes still quicker for achieving simple tasks to execute a one linux command. I'm starting by sharing 3 I use quite often.
## 1 get the sequences length distribution form a fastq file using awk
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
##2 Reverse complement a sequence (I use that a lot when I need to design primers)
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
##3 split a multifasta file into single ones with csplit:
csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}
I may be wrong, but I've not found such a list in Biostars.
So, what comes to your mind? I hope this post will yield some gold nuggets ;-)
I still cannot help but find the existence of a list of 'favourite one liners' a bit paradoxical, in my opinion.
If they are used routinely for your day-to-day tasks, maybe it is time to 'graduate' them to an utility script, that may be easier to maintain and/or might benefit the others?
One-liners are like old-time Lego(tm) bricks, you can build lots of fun and different stuff with'em.
If you pack them into tools, they will become like current-day Legos: you attach a head, two arms and two legs to a body and voilá, "constructed" a robot.
ADD REPLY
• link
updated 2.0 years ago by
Ram
44k
•
written 9.7 years ago by
h.mon
35k
srand() is the seed for the random number generator - keeps the subsampling the same when the script is run multiple times. 0.01 is the % of reads to output.
Not exactly a one-liner, but I find this a very useful way to make files immutable, and most people probably won't have come across it. An immutable file cannot be modified or written to, deleted, renamed, or moved - even by the superuser. When I receive data or download it, this is the first thing I do, and it's saved a lot of heartache over the years.
sudo chattr +i file.fq #To archive a file
sudo chattr -i file.fq #To unarchive it again
There are quite a lot useful one liners. This is just a drop in the ocean :).
## Number of reads in a fastq file
cat file.fq | echo $((`wc -l`/4))
## Single line fasta file to multi-line fasta of 60 characteres each line
awk -v FS= '`/^>/{print;next}`{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file
## Sequence length of every entry in a multifasta file
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa
Running fastqc for all the fastq files in multiple sample folders in parallel mode.
for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done # will create commands
for i in Sample_*/*.fastq.gz ; do echo echo fastqc $i\|qsub -cwd; done|sh #launches jobs on cluster
Parallelize time-consuming processes on Unix systems
Using mpileup for a whole genome can take forever. So, handling each chromosome separately and parallely running them on several cores will speed up your pipeline. Using xargs you can easily realize it.
Example usage of xargs (-P is the number of parallel processes started - don't use more than the number of cores you have available):
Paging umer.zeeshan.ijaz
I remember I saw this: Useful Bash Commands To Handle Fasta Files