First of all, I wanna thank @RamRS and @Pierre Lindenbaum,
You guys gave me very useful suggestions several times here, thank you.
I did blastx my query file ( the file include more than 140000 nucleotides sequences) to db file (the file include more than1400 polypeptide sequences ), and I got my result. but the generated file is around 1.4GB. and I checked the blast result, it shows that most of the query sequences aligned to the the at least one of the db sequences( but not all of them with high e value and score):
Query= comp1896_c0_seq1 len=2039 path=[0:0-259 2272:260-284 285:285-2038]
Length=2039
Score E
Sequences producing significant alignments: (Bits) Value
FBpp0083843 FBgn0028684 symbol:Tbp-1 family:Transcription Cofac... 152 5e-41
FBpp0081704 FBgn0040078 symbol:pont family:Chromatin Remodeling... 44.3 1e-05
FBpp0074756 FBgn0040075 symbol:rept family:Chromatin Remodeling... 33.5 0.037
FBpp0099511 FBgn0004913 symbol:Gnf1 family:Transcription Cofact... 31.2 0.22
Tribolium_TF472 26.9 2.9
Tribolium_TF80 26.9 3.7
I think first two hit might be the result I want, right? Then how can I screen that results from not ideal ones?
kurban@kurban-X550VC:~/Desktop/tf$ blastx -help
USAGE
blastx [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-entrez_query entrez_query]
[-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-evalue evalue] [-word_size int_value]
[-gapopen open_penalty] [-gapextend extend_penalty]
[-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-max_hsps_per_subject int_value] [-max_intron_length length]
[-seg SEG_options] [-soft_masking soft_masking] [-matrix matrix_name]
[-threshold float_value] [-culling_limit int_value]
[-best_hit_overhang float_value] [-best_hit_score_edge float_value]
[-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
[-strand strand] [-parse_deflines] [-query_gencode int_value]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-html] [-max_target_seqs num_sequences]
[-num_threads int_value] [-remote] [-comp_based_stats compo]
[-use_sw_tback] [-version]
If it can be done by changing this:
blastx -query gene.fa -out tf.blastx -db TFs.fasta
how should I change?
You can set a threshold for the evalue (~1e-3) and/or the score. That should give you optimal hits for all your query sequences.
Mr.RamRS,
Is there anyway I can extract the aligned query sequences and their best hit from the generated blastx file? Because it is still pretty big.
use the option
-num_alignments
?thanks. but my query sequences more than 140000, so I just want to see aligned query sequences. but the result gives all the query , and then notify that "aligned... or no hit was found" , respectively. so if I can only extract the aligned query sequences and its aligned seq. from the db would simplify my job a lot.