Question

Struggling with Meme-suite, looking for troubleshooting tips

0

Entering edit mode

6 months ago

Ronin ▴ 10

Hi all, I'm working with some ATAC-seq data and am trying to use Meme to get motifs and ontology work done. I'm going to do my best to describe what I did and share some of my output files.

First, I took my peak data from MACS2 and ran it through dust after converting to fasta. I did this by using:

dust filtered_sequences.fa 20 > dust_filtered_sequences.fa

A snippet of the resulting file:

>CM020909.1:74579-74733:11
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNg
>CM020909.1:87264-87332
agcatgtagtcttacctagctactgagcatgtagtccttacctagctact
gagcgtgtagtcttacct
>CM020909.1:87947-87997
actgagcgtgtagtcttacctagctactgagcatatagtccttacctagc
>CM020909.1:88137-88178
tccttacctagctactgagcatgtagtccttacctagctac

Then, I ran meme. I used:

meme dust_filtered_sequences.fa -o ./meme-output -dna -mod anr -nmotifs 25 -minw 6 -maxw 30

Which seemed to work, because I get the output like so:

********************************************************************************
MOTIF AGTATAGTATGTCGAAAAAAGTSATARWA MEME-1  width =  29  sites = 501  llr = 17677  E-value = 3.2e-026
********************************************************************************
--------------------------------------------------------------------------------
    Motif AGTATAGTATGTCGAAAAAAGTSATARWA MEME-1 Description
--------------------------------------------------------------------------------
Simplified        A  a::a:a::a:::::aaaa9a::191966a
pos.-specific     C  :::::::1::::a::::::::25::::::
probability       G  :a::::a:::a::a::::1:9:4::14::
matrix            T  ::a:a::9:a:a::::::::18::9::4:

         bits    2.1 * ****  **    ****           
                 1.9 ******* ********** *        *
                 1.6 ******* ********** **  ***  *
                 1.4 ********************** ***  *
Relative         1.2 ********************** ***  *
Entropy          1.0 ********************** ******
(50.9 bits)      0.8 ********************** ******
                 0.6 *****************************
                 0.4 *****************************
                 0.2 *****************************
                 0.0 -----------------------------

Multilevel           AGTATAGTATGTCGAAAAAAGTCATAAAA
consensus                                  G   GT 
sequence                                          

--------------------------------------------------------------------------------

Then, I ran fasta-get-markov:

fasta-get-markov dust_filtered_sequences.fa > macs2.model

Which gets me these results:

# 0-order Markov frequencies from file dust_filtered_sequences_filtered.fa
# seqs: 340667    min: 8    max: 2019    avg: 123.1    sum: 41927288    alph: DNA
# order 0
A 2.406e-01
C 2.594e-01
G 2.594e-01
T 2.406e-01

Then, I converted my output meme.txt file from earlier into cisML:

ama ./meme-output/meme.txt dust_filtered_sequences_filtered.fa macs2.model --pvalues --o macs2_ama_output

Which then leads me to gomo. I downloaded the file (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Perca_flavescens/annotation_releases/current/100/GCF_004354835.1_PFLA_1.0/GCF_004354835.1_PFLA_1.0_gene_ontology.gaf.gz) which would give me the annotated file needed for my organism of choice (Perca fluviatilis), and I de-duplicated it so it could be run. I used a python script for that:

import pandas as pd

# Load your data into a DataFrame with specified dtypes or low_memory option
data = pd.read_csv('GCF_004354835.1_PFLA_1.0_gene_ontology.gaf', sep='\t', header=None, low_memory=False)
# Alternatively, specify dtypes for the problematic columns
# data = pd.read_csv('GCF_004354835.1_PFLA_1.0_gene_ontology.gaf', sep='\t', header=None, dtype={0: str, 8: str, 9: str, 12: str, 14: str, 15: str})

# Deduplicate based on the first column (gene ID)
deduplicated_data = data.drop_duplicates(subset=[0])

# Save the deduplicated data
deduplicated_data.to_csv('dedup_GCF_004354835.1_PFLA_1.0_gene_ontology.gaf', sep='\t', index=False, header=False)

And then I run gomo, which I did like so:

gomo --verbosity 1 --oc gomo_out --motifs ./meme-output/meme.xml dedup_GCF_004354835.1_PFLA_1.0_gene_ontology.gaf ./macs2_ama_output/ama.xml

So this is where I am at. The output from gomo is nothing, just an empty .tsv file. Sadness.

I am looking for some tips or advice for how I should proceed. It seems like I can run all of the commands using the files I have, so my guess is that it may be something wrong with either my .gaf file for gomo, or something with my processing of the sequence data (or both!). For whatever it is worth, Homer worked for finding motifs as well, but I'm not sure if I can use the output from Homer as an input for gomo (nor have I seen any literature where anyone is doing that, but maybe I'm wrong there).

Thanks for reading and I appreciate any advice. I'm happy to add any further information if it would help solve this problem.

ATAC-seq Meme Motifs • 274 views

ADD COMMENT • link 6 months ago by Ronin ▴ 10