Hi all, I'm working with some ATAC-seq data and am trying to use Meme to get motifs and ontology work done. I'm going to do my best to describe what I did and share some of my output files.
First, I took my peak data from MACS2 and ran it through dust after converting to fasta. I did this by using:
dust filtered_sequences.fa 20 > dust_filtered_sequences.fa
A snippet of the resulting file:
>CM020909.1:74579-74733:11
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNg
>CM020909.1:87264-87332
agcatgtagtcttacctagctactgagcatgtagtccttacctagctact
gagcgtgtagtcttacct
>CM020909.1:87947-87997
actgagcgtgtagtcttacctagctactgagcatatagtccttacctagc
>CM020909.1:88137-88178
tccttacctagctactgagcatgtagtccttacctagctac
Then, I ran meme. I used:
meme dust_filtered_sequences.fa -o ./meme-output -dna -mod anr -nmotifs 25 -minw 6 -maxw 30
Which seemed to work, because I get the output like so:
********************************************************************************
MOTIF AGTATAGTATGTCGAAAAAAGTSATARWA MEME-1 width = 29 sites = 501 llr = 17677 E-value = 3.2e-026
********************************************************************************
--------------------------------------------------------------------------------
Motif AGTATAGTATGTCGAAAAAAGTSATARWA MEME-1 Description
--------------------------------------------------------------------------------
Simplified A a::a:a::a:::::aaaa9a::191966a
pos.-specific C :::::::1::::a::::::::25::::::
probability G :a::::a:::a::a::::1:9:4::14::
matrix T ::a:a::9:a:a::::::::18::9::4:
bits 2.1 * **** ** ****
1.9 ******* ********** * *
1.6 ******* ********** ** *** *
1.4 ********************** *** *
Relative 1.2 ********************** *** *
Entropy 1.0 ********************** ******
(50.9 bits) 0.8 ********************** ******
0.6 *****************************
0.4 *****************************
0.2 *****************************
0.0 -----------------------------
Multilevel AGTATAGTATGTCGAAAAAAGTCATAAAA
consensus G GT
sequence
--------------------------------------------------------------------------------
Then, I ran fasta-get-markov:
fasta-get-markov dust_filtered_sequences.fa > macs2.model
Which gets me these results:
# 0-order Markov frequencies from file dust_filtered_sequences_filtered.fa
# seqs: 340667 min: 8 max: 2019 avg: 123.1 sum: 41927288 alph: DNA
# order 0
A 2.406e-01
C 2.594e-01
G 2.594e-01
T 2.406e-01
Then, I converted my output meme.txt file from earlier into cisML:
ama ./meme-output/meme.txt dust_filtered_sequences_filtered.fa macs2.model --pvalues --o macs2_ama_output
Which then leads me to gomo. I downloaded the file (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Perca_flavescens/annotation_releases/current/100/GCF_004354835.1_PFLA_1.0/GCF_004354835.1_PFLA_1.0_gene_ontology.gaf.gz) which would give me the annotated file needed for my organism of choice (Perca fluviatilis), and I de-duplicated it so it could be run. I used a python script for that:
import pandas as pd
# Load your data into a DataFrame with specified dtypes or low_memory option
data = pd.read_csv('GCF_004354835.1_PFLA_1.0_gene_ontology.gaf', sep='\t', header=None, low_memory=False)
# Alternatively, specify dtypes for the problematic columns
# data = pd.read_csv('GCF_004354835.1_PFLA_1.0_gene_ontology.gaf', sep='\t', header=None, dtype={0: str, 8: str, 9: str, 12: str, 14: str, 15: str})
# Deduplicate based on the first column (gene ID)
deduplicated_data = data.drop_duplicates(subset=[0])
# Save the deduplicated data
deduplicated_data.to_csv('dedup_GCF_004354835.1_PFLA_1.0_gene_ontology.gaf', sep='\t', index=False, header=False)
And then I run gomo, which I did like so:
gomo --verbosity 1 --oc gomo_out --motifs ./meme-output/meme.xml dedup_GCF_004354835.1_PFLA_1.0_gene_ontology.gaf ./macs2_ama_output/ama.xml
So this is where I am at. The output from gomo is nothing, just an empty .tsv file. Sadness.
I am looking for some tips or advice for how I should proceed. It seems like I can run all of the commands using the files I have, so my guess is that it may be something wrong with either my .gaf file for gomo, or something with my processing of the sequence data (or both!). For whatever it is worth, Homer worked for finding motifs as well, but I'm not sure if I can use the output from Homer as an input for gomo (nor have I seen any literature where anyone is doing that, but maybe I'm wrong there).
Thanks for reading and I appreciate any advice. I'm happy to add any further information if it would help solve this problem.