multiple nucleotide sequences align to multiple protein sequences
3
0
Entering edit mode
10.0 years ago
Kurban ▴ 230

i have two fasta file , one of them contain nucleotide sequences (more than 100000), another one contain multiple polypeptide sequences (more than a 1000). i wanna search nucleotide sequences which could be aligned to these protein sequences in the protein sequences file.
i am new at this , if any one could give any suggestion little bit in detail would be appreciated.

alignment • 3.0k views
ADD COMMENT
0
Entering edit mode
10.0 years ago
Ram 44k

You might just wanna create a database with makeblastdb out of one file and BLAST the other against it, with blastx or tblastnbased on the db and the query..

ADD COMMENT
0
Entering edit mode
10.0 years ago

Use blast. http://www.ncbi.nlm.nih.gov/books/NBK1763/

Compile your proteins sequence file with makeblastdb.

Search the new database with blastx ("The "blastx" application translates a nucleotide query in six frames and searches it against a protein database. ")

ADD COMMENT
0
Entering edit mode
10.0 years ago
Kurban ▴ 230

First of all, I wanna thank @RamRS and @Pierre Lindenbaum,

You guys gave me very useful suggestions several times here, thank you.

I did blastx my query file ( the file include more than 140000 nucleotides sequences) to db file (the file include more than1400 polypeptide sequences ), and I got my result. but the generated file is around 1.4GB. and I checked the blast result, it shows that most of the query sequences aligned to the the at least one of the db sequences( but not all of them with high e value and score):

Query= comp1896_c0_seq1 len=2039 path=[0:0-259 2272:260-284 285:285-2038]

Length=2039
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  FBpp0083843 FBgn0028684 symbol:Tbp-1 family:Transcription Cofac...   152    5e-41
  FBpp0081704 FBgn0040078 symbol:pont family:Chromatin Remodeling...  44.3    1e-05
  FBpp0074756 FBgn0040075 symbol:rept family:Chromatin Remodeling...  33.5    0.037
  FBpp0099511 FBgn0004913 symbol:Gnf1 family:Transcription Cofact...  31.2    0.22
  Tribolium_TF472                                                     26.9    2.9
  Tribolium_TF80                                                      26.9    3.7

I think first two hit might be the result I want, right? Then how can I screen that results from not ideal ones?

kurban@kurban-X550VC:~/Desktop/tf$ blastx -help
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-max_hsps_per_subject int_value] [-max_intron_length length]
    [-seg SEG_options] [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-html] [-max_target_seqs num_sequences]
    [-num_threads int_value] [-remote] [-comp_based_stats compo]
    [-use_sw_tback] [-version]

If it can be done by changing this:

blastx -query gene.fa -out tf.blastx -db TFs.fasta

how should I change?

ADD COMMENT
0
Entering edit mode

You can set a threshold for the evalue (~1e-3) and/or the score. That should give you optimal hits for all your query sequences.

ADD REPLY
0
Entering edit mode

Mr.RamRS,

Is there anyway I can extract the aligned query sequences and their best hit from the generated blastx file? Because it is still pretty big.

ADD REPLY
0
Entering edit mode

use the option -num_alignments?

ADD REPLY
0
Entering edit mode

thanks. but my query sequences more than 140000, so I just want to see aligned query sequences. but the result gives all the query , and then notify that "aligned... or no hit was found" , respectively. so if I can only extract the aligned query sequences and its aligned seq. from the db would simplify my job a lot.

ADD REPLY

Login before adding your answer.

Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6