Entering edit mode
9.6 years ago
jon.brate
▴
310
I have many files with single gene alignments in fasta fomat and many sequences in these consist of gaps only. How should I approach to remove these?
Thanks, Jon
How?
Not obvious after RTFM of bioawk
can you post some example data and expected output? @ gbl1
Minimal input:
I think there is some formatting problem in this post. I see only Ns in first sequences and I do not see any gaps. What is the requirement? from your input and output, try following using seqkit:
ps: note that input is not clear. This script works on the example data posted here and output matches with example output. This script can be improved once you give some feed back. Script looks for N's and does a inverse grep.
indeed, formating issue...
So, i tried: bash: seqkit: command not found
Above I linked seqkit to github page and gnu-linux binary is available for download. If you have conda or brew installed, try
conda install seqkit
orbrew install seqkit
Actually, I've got another way… I need to ask the computer service… I do not have admin right on my university computer
You don't need admin rights to install tools using conda.
Just to add to this: Admin rights are not needed to install conda or brew - Both give you ways to bypass that requirement. Once the tools themselves are installed, they (at least brew does this) work to ensure you don't use root access.
seqkit doesn't need installation. Download binary, keep it some where in user home directory and add to user path. In the mean time try this awk script (modified after sayuj.koyyappurath script in https://www.biostars.org/p/9262/):
output: