One possible piped solution to report inexact matches in a genome to fasta formatted queries =
BBmap (from BBtools) | sam2bed (from bedops)
See https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/
Steps and syntax that might be applicable for my case:
Step 1. Build searchable index for reference genome to be scanned for matches
bbmap.sh ref=Reference.fasta
Step 2. Scan reference genome index with
bbmap.sh minid=0.8 idfilter=0.8 secondary=T maxsites=500 in=Query.fasta out=Query_Vs_Ref_BBmap_minid_idfilter_80per_multi.sam -noheader
Explanation of the flags
minid
= is approximate, but can help accelerate BBmap (paraphrasing Brian Bushnell, BBmap's author)
idfilter
= "to filter out alignments with under exactly x% identity" (quoting Brian Bushnell)
secondary
= enables reporting match details for not just the first or best match
maxsites
= provides a ceiling for the number of matching genomic regions for which details will be saved to the output file
noheader
= I didn't want bulkier output files than required, therefore suppressing SAM file format header info
Step 3. Convert output SAM file to BED format
sam2bed < BBmap_out.sam > Coords_file.bed
Step 4. Parse BED format for match coords report
Use awk, Perl or Python script, or BBmap reporting option in step 2 itself
AFAIK, NGS mappers like BBmap were not designed with such a goal in mind, so I've still got to QC my suggested little pipeline with control cases. So this is a placeholder, and I will edit details if and when necessary.
Software versions : BBTools - BBMap version 38.35, bedops - version: 2.4.35 (typical)
Downloaded from : https://sourceforge.net/projects/bbmap/ and https://bedops.readthedocs.io/en/latest/index.html
PS. Thanks to Brian Bushnell for advising use of idfilter
flag during the BBmap step.
Perhaps Brian can render the final verdict on validity of this syntax?! Though I don't see any posts from him in > 1 year now...
note: This is the solution for the problem I described in the now closed thread at BLAST run parameters and parsing advice
I would use bwa mem followed by a pysam script to apply the requested filters.
I've used neither BWA not pysam, so could you share any code snippet, or links to any examples that will help me understand your suggestion correctly and quickly?
Also, am I right in interpreting your response as BWA MEM reporting not just PERFECT matches, but other matches with < 100% identity and < 100% query sequence coverage in step 1
And piping this to PYSAM for the >= 80% filters, as step 2, correct? Will there be some more parsing to report genomic coords as step 3, or will that be taken care of by the PYSAM step itself? Thank you!
Yes. BWA-MEM has myriad parameters which you can tune, including how many mismatches you allow.
Don't think about equivalence between BLAST and NGS alignments, With blast your are looking for local alignments. WIth NGS aligners the alignments are global/glocal due to the small size of the query sequences.
Many NGS alingers can take fasta formatted reads as query (BBMap does).
I have to do the equivalent of
-perc_identity 0.8 -qcov_hsp_perc 0.8
because it has been done in past studies, though strangely using BLAST, not accounting for its local SW algo. I agree with your observation that I need a global aligner that uses the Needleman-Wunsch algo.However, I still need to apply those 2 filters for consistency and comparison with other studies. Problem is I can't find the appropriate equivalent flags in any NGS mapper yet...
But I think,
minid=0.80
in BBmap ~-perc_identity 0.8
in BLASTn, not sure yet. But I still have not found the equivalent of-qcov_hsp_perc 0.8
in BBmap, or in EMBOSS needle or seq-align, yet.I am hoping some forum user may have some syntax pointers about how to apply these 2 filters with some global aligner...