I have just completed a blastx on some fasta sequences against the uniprot protein database. I have obtained the BLAST tabular output format 6. This is my 1st time running blastx. Here is an example of my output.
$head blastx_uniprotsprot_spl_m40a001.outfmt6
NS500162:92:H2L22BGXX:1:11112:5797:6378 APEX1_DANRE 42.857 42 22 1 125 6 166 207 0.36 30.4
NS500162:92:H2L22BGXX:1:11112:8897:6378 LIPA2_SYNY3 79.592 49 10 0 150 4 226 274 2.39e-18 78.2
NS500162:92:H2L22BGXX:1:11112:9991:6378 SERC_METBF 68.000 50 16 0 150 1 66 115 1.23e-17 76.6
A few specifics on my data:
My sample: M40A, has a total of 10,913,966 nucleotide sequences in fasta format
and: wc -l M40A_blastx_uniprot.txt = 34,809,040 (raw output of blastx)
Thinking I have a lot of duplicate reads aligning to multiple places in the database, I decided to pull out all unique sequences:
awk '{print$1}' M40A_blastx_uniprot.txt| sort | uniq | wc -l = 33,969,282
The next thing I did was to see if I could grep one of these sequences from my fasta file - which produced zero results!
$grep -A 1 'NS500162:92:H2L22BGXX:1:11112:5797:6378' ../m40c.r1.fa
$grep
Randomly choosing a bunch of these sequence IDs (from blastx column 1 output) and trying to grep the original sequences from my fasta file gives me the same results each time (zero results).
I'm beginning to think my results are wrong, but I thought I would ask anyone here if this is "normal" for a blastx outfmt6 file? My questions are this:
1) How can I have 33 million unique sequence IDs from my blastx output file when I've put in 10 million fasta reads? Is this a standard output - to have this many duplicate hits? For my data, this means each sequence has > than 3 hits.
2) How can I pull out the original sequences by the seqid given in blastx results when grep gives me zero results? Is there a separate script that I am missing that allows for this? Looking into this answer at other websites, other researchers just use standard grep.
Any insight into this rabbit hole I'm suddenly in would be greatly appreciated!
Many Thank-you's!
BLAST doesn't give you a single exact match for a given query. It is a local aligner, so it will have found subregions of your queries that match other things. For some of your queries it might only find one, thus you end up with results > the number of query sequences. We would need to see some example headers and content from your fasta file too to be able to say why the
grep
is failing.jrj.healey:
Thank you very much for your explanation. This makes cogent sense! As for my fasta file: Here is my header for my fa file. I am not exactly sure what you mean by content. I will say that the sequences have been quality filtered with Q > 20 and they range from 20bp - 150pb (due to quality filtering; before filtering they are 150bp).
I think you may need to upload the whole input file somewhere. The
grep
command you provided, and that sample data, will obviously not work, because the strings don't match after the first colon, your sample data has 172 and the original command looks forgrep -A 1 'NS500162:
92... but I would do you the service of presuming that the files with the substring 92 just appear later in the file and not that you're just trying togrep
for the wrong thing.Have you tried searching manually in the file? E.g. open the file with
nano
and look for particular strings withCtrl+W
. This would tell us whether yourgrep
is malformed in some way, or your fasta file is maybe not what you think it is.Unlikely. See this response: C: Can't grep qseqid (column 1) from blastx outfmt6
Yeah it wasn't high on my list of likely culprits. It looked OK by eye anyway.
See (remove
-exclude
): C: How to remove sequences from a fasta file based on ID list?Why is
grep
not working? Have you tried the command without the quotes?Genomax2:
Thank you for your reply. I have also tried the following with grep (all have given me zero results):
By the way, this third grep command is incredibly memory intensive (>10gb) to run on 30million + sequence IDs in a file; where seqIDs.txt is all of my seqIDs from my blastx results (all of column 1) ex:
$head seqIDs.txt
I then switched to seqtk seqtk which I have used numerous times in the past (it's super fast) to pull out fasta/fastq seqs from a list of sequence IDs with this command:
$seqtk subseq m40a.r1.fa seqIds.txt
which only gave me a blank file, as well.My fasta file looks like this:
I am able to get results instantly using your example data and the same
grep
command. Not sure why this is not working for you.Me either :( Yes - I have tested the grep command and get results, all the same. For example, if I were to grep the first seq ID from my .fa file, out of that same .fa file, I get a result. If I were to grep the 1st seqID from the blastx result, from the blastx file, I get a result. It's when I try to grep the first seqID from my blastx result (from column 1) against my .fa file, I get nothing. Quite the paradox.