Question

Subject Sequence Id from Blast results

0

Entering edit mode

9.7 years ago

zgayk ▴ 90

Hello,

I'm just learning how to use blast from the command line. I used blastn to search contigs in our genome assembly against a local database of the chicken genome. I ran two separate runs one with -outfmt 0 (pairwise) and -outfmt 7 (tabular with comment lines).

What I would like to do is extract the dna alignments of each query and subject sequence and put these in a file with query and subject sequence ids. Because I am very new to blast from the command line I have two questions:

Instead of extracting the pairwise alignments from the outfmt 0 and sequence ids from outfmt 7, would it be easier to make some custom blast command that returns both pieces of information in one database? If so any suggestions on how to do this?
Results from both blast runs only show numbers (e.g. 20) for the subject sequence ids matching to our assembly in the chicken genome. The fasta formatted chicken assembly I am using does have a separate .gff file that does have subject sequence ids, but I can't figure out how to add the .gff file to the chicken blast database I made. I don't think the makeblastdb command supports .gff files so I don't know how to extract the subject ids from this. Any ideas you have are greatly appreciated.

Thanks,
Zach

blast • 5.2k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.7 years ago by zgayk ▴ 90

0

Entering edit mode

Sorry zgayk, but I am not sure of want are you trying to do.

You want to recover information which is in the outfmt 0 OR outfmt 7 but not both? What is the purpose of your data extraction? (to understand better) :)

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.7 years ago by glihm ▴ 660

0

Entering edit mode

Hello,

On thinking about it, I am only concerned with the second question in my original post. I have blast results in the outmt 0 that I want to extract all of the aligned query and subject sequences from. I also need the sequence IDS so that I know what each query and subject sequences refers to. Ideally, I would like to get the gene name for the subject sequences because the main point of this blast run is to retrieve orthologous gene sequences matching between our genome assembly of a different bird species and the chicken genome (subject match). The main problem has been the way the subject matches in the chicken genome assembly are labelled. In the outfmt 0 results, subject matches are simply labelled with a number (chromosome), making them not useful at all to see which chicken gene my scaffolds are matching to (in this case the subject match is 26):

Query= scaffold_0
Length=1506
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value
26                                                                    300     5e-79

>26
Length=5102438
 Score = 300 bits (332),  Expect = 5e-79
 Identities = 447/634 (71%), Gaps = 65/634 (10%)
 Strand=Plus/Minus
Query  304     TGTCTTAAATAGGCACTTGGTATGGAAGAAACTTGTGCATACACCCTAGTGTGGCAGGGG  363
               ||||  |||| |||| | || |||||||| ||||||||||||| |||||    |||||||
Sbjct  256300  TGTCCCAAATGGGCAATGGGCATGGAAGACACTTGTGCATACAGCCTAGCA--GCAGGGG  256243
Query  364     AAGAGCTGGAAACAGGACTTACGTCTTCAGACATGCTCAAACACATTCCTCCTCCTGGGC  423
               ||||  |||||| ||| | |||||||||   |||||||||||||||||||| |||||||
Sbjct  256242  AAGAATTGGAAATAGGCCCTACGTCTTCCAGCATGCTCAAACACATTCCTCATCCTGGGT  256183
Query  424     CAACGCATGCCAGAGAGAGTGCATAGCACAGTGGAGGTCTTTCCCACATTGGTGCCCTGG  483
               || ||||||||||           ||||||  |   | || ||||||||  ||||    |
Sbjct  256182  CAGCGCATGCCAG-----------AGCACACAGACAGCCTCTCCCACATCAGTGCATCAG  256134
Query  484     AAGGGGAGTGTGGAGGGGAGAGGTGGGTGTCTCACCAGGAAGGTTGCAGGTGCAGGTGGT  543
                || |||  |||| |||||||||| |||||| |||||||              ||||| |
Sbjct  256133  GAGAGGAACGTGG-GGGGAGAGGT-GGTGTCCCACCAGG--------------AGGTGCT  256090
Query  544     TCTGCGCTGTGCAGGACAAATGATCCAGGGCATGTTGTTTCTCCAGCGCCTCTCTTGTGC  603
               ||||  ||| | ||||||||  |||| |||||     ||||  ||||  | |||||||||

A more useful sequence id for the chicken genome seems to be in the associated .gff file, where for example the numer identifier on the left also has a sequence identifier such as GALT00000025000 which I can actually search for. My main question is how to parse the blast output to get meaningful sequence identifiers if blast can't search a .gff file. Is there an easier way to get the chicken subject sequence ids, or will I have to extract the information from the .gff file and then match it to the number sequence identifiers in the chicken blast output?

2    GeneWise        mRNA    118032794    118035699    98.24   +    .    ID=GALT00000025000;
2    GeneWise        CDS     118032794    118032874    .    +    0    Parent=GALT00000025000;
2    GeneWise        CDS     118034101    118035699    .    +    0    Parent=GALT00000025000;

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.7 years ago by zgayk ▴ 90