Question

Biopython hmmscan parser

0

Entering edit mode

9.7 years ago

nvijay.1991 • 0

Hello everyone,

I have a hmmscan output in hmmscan3-domtab format with more then 10000 queries.

I want to filter each profile hit based on hit_coverage (hmm profile coverage).

I was scanning queryresult--> Hits -->Hsps --> HSPfragment this iterations and calculating hit_coverage then writing to another file (filtered file) using using SearchIO in biopython write method but while writing I found that all the hits were going to filtered file.

Below is the code I have used:

for qresult in SearchIO.parse(file_path,'hmmscan3-domtab'):
       pass
       for each_hit in qresult:
           pass
           for each_hsp in each_hit:
               pass
               if re.match("^gfam",each_hit.id) and (100*(each_hsp.hit_end  - each_hsp.hit_start)/each_hit.seq_len) >= 75:
                   SearchIO.write(qresult,filtered_file_handler,'hmmscan3-domtab')
               elif not re.match("^gfam",each_hit.id) and (100*(each_hsp.hit_end - each_hsp.hit_start)/each_hit.seq_len) >= 50:
                    SearchIO.write(qresult,filtered_file_handler,'hmmscan3-domtab')

Any suggestions for sending only filtered hits to the filtered file.

PS - I have noticed that I am writing qresult. I need some suggestions for refining on this.

Thanks,
Vijay N

hmmscan biopython SearchIO • 3.5k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.7 years ago by nvijay.1991 • 0

score 0 · Answer 1 · 2017-04-17

This is the one I wrote. I need to do some error handling for when there are no hits b/c it will break. I put stuff like this in my Jupyter profile and it speeds stuff up... a lot.

def read_hmmscan(path, set_index="query_name", sort_values="sequence|_|e-value"):
    # Build pd.DataFrame
    df_tmp = pd.read_table(path, header=None, index_col=0)
    mask_idx = df_tmp.index.map(lambda x: x.startswith("#") == False)
    tmp_list = df_tmp.loc[mask_idx,:].index.map(lambda x: [y for y in x.split(" ") if (len(y) > 0) and (y != "-")]).tolist()
    data = list()
    for line in tmp_list:
        data.append(pd.Series(line))
    df_hmmscan = pd.DataFrame(data)
    # Labels
    id_cols = ["target_name", "query_name"]
    sequence_cols = ["sequence|_|e-value", "sequence|_|score", "sequence|_|bias"]
    domain_cols = ["domain|_|e-value", "domain|_|score", "domain|_|bias"]
    misc_cols = ["exp", "reg", "clu", "ov", "env", "dom", "rep", "inc"]
    df_hmmscan.columns  = id_cols + sequence_cols + domain_cols + misc_cols
    # Force types
    df_hmmscan.loc[:,sequence_cols + domain_cols] = df_hmmscan.loc[:,sequence_cols + domain_cols].astype(float)
    df_hmmscan.loc[:,"exp"] = df_hmmscan.loc[:,"exp"].astype(float)
    df_hmmscan.loc[:,misc_cols[1:]] = df_hmmscan.loc[:,misc_cols[1:]].astype(int)
    # Reformat structure
    if set_index:
        df_hmmscan = df_hmmscan.set_index(set_index)
    if sort_values:
        df_hmmscan = df_hmmscan.sort_values(sort_values, ascending=True)
    return df_hmmscan