I am trying to parse a tabular text file generated by Blastp using awk. Previously I have used this somewhat ugly code, because it worked, to go through to the right columns and cull out values below what I wanted.
#!/bin/bash
#$ -cwd
#$ -pe mpi 16
awk '$4 > 80.0' blastoutput.txt > StepOne.txt
awk '$5 > 70.0' StepOne.txt > Culled.txt
Using it on a new blast result however, the file sizes remain at 300k kb with only a slight decrease on step one, and none for two. My best guess is that it is only recognizing a single line from the whole blast output file, and therefore not removing more. I would think maybe it had something to do with Unix/Windows line ends not being recognized as I saw on other answers, but the thing is I haven't changed the way I've generated the blast results and it was working before, so I don't know why it would all of a sudden change the way tabular results are created.
I've also tried using some parsing options I saw in other answers like the following:
perl -lane 'print $_ if ($F[4] >80.0)' blastp_output_8_26.txt > StepOne.txt
but the results seem to be the same.
Does anyone know what I could do to the blastp output file to make it work with my code? I am convinced something is amiss there, but all my attempts to fix it so far have been for naught.
Thanks.
Does it work if you use
$4 > 80
instead of$4 > 80.0
? I know it makes no sense but just try it out once. Also, can you compare an old blast result and a new one side by side and make sure the columns line up?I tried that ya (changing to just 80), and it didn't seem to make any difference.
Well I have only two so far to compare, but they were done with different output options for each column, but the same output format 6. To view them I have opened each text file in chrome, and besides having different columns they don't appear to be any different to me. They each have tabs that separate out the columns etc. Perhaps you are right though, and somehow by changing the output column contents it introduced some sort of issue with the line end?
Text files should be looked at in a text editor, not Chrome. What is the output to the following command for each of the two files:
The first file I don't have access to at the moment, but I will try it as soon as I can. For the file that doesn't work, here is the result:
This is properly tab separated, all right. It should work.
EDIT: Works for me:
(Nothing over 80)
(3 rows over 40)
Your script indeed seems to work, (I still need to test on the full file), but I'm not 100% sure what your code does. It looks to me like you are replacing ASCII tabs with linux tabs? So I suppose if I were to run this together, I only need to run the sed command once, since the tabs will be proper now.
Don't worry about the
sed
- it exists to remove the symbols introduced bycat -te
. It's theawk
that's relevant.First thing is the format in default blast outfmt is [
qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore
], I am not sure what are you filtering but awk counts starting index in 1, if your file is a starndard output, you are filtering by length then by mismatch.you can combine filtering too:
My output format is this:
then you need to check if your output file is corrupted or have strange char (\r). Your filtering has nothing wrong.