Hi there,
I am working with the MEME Suite tools in R. I started by using the web-based MEME Suite motif discovery tools to generate a matrix of motifs found in sequences in fasta format. I then used this matrix with a fasta file to run fimo to get the motif locations, and loaded the output .tsv file into R with importFimo(). I have been following this tutorial , but I can't figure out how to view the motifs.
My GRanges object looks like this:
> fimo_tsv <- "/home/full_fimo_ps2.tsv"
> fimo <- importFimo(fimo_tsv)
> fimo
GRanges object with 6266 ranges and 6 metadata columns:
seqnames ranges strand | motif_id motif_alt_id score pvalue qvalue matched_sequence
<Rle> <IRanges> <Rle> | <character> <character> <numeric> <numeric> <numeric> <character>
[1] chr8:1555744:1556744 364-378 + | MOTIF AGGCGGCAGCGAGGC 29.1697 3.57e-10 1.17e-06 aggcggcagcgaggc
[2] chr4:133865601:13386.. 377-391 + | MOTIF AGGCGGCAGCGAGGC 29.1697 3.57e-10 1.17e-06 aggcggcagcgaggc
[3] chr3:1429245:1430245 403-417 + | MOTIF AGGCGGCAGCGAGGC 29.1697 3.57e-10 1.17e-06 aggcggcagcgaggc
[4] chr4:171709153:17171.. 405-419 + | MOTIF AGGCGGCAGCGAGGC 29.1697 3.57e-10 1.17e-06 aggcggcagcgaggc
[5] chr8:77220352:77221352 406-420 + | MOTIF AGGCGGCAGCGAGGC 29.1697 3.57e-10 1.17e-06 aggcggcagcgaggc
... ... ... ... . ... ... ... ... ... ...
[6262] chr5:91494312:91495312 427-441 - | MOTIF CCATTGCCCAGGCTT 5.58788 9.92e-05 0.0949 CCCTCCCCCAGCCTC
[6263] chr12:73546425:73547.. 427-441 - | MOTIF CCATTGCCCAGGCTT 5.58788 9.92e-05 0.0949 CCCTCACCCAGCCTG
[6264] chr4:112258100:11225.. 427-441 - | MOTIF CCATTGCCCAGGCTT 5.58788 9.92e-05 0.0949 CCCTCCCCCAGCCTG
[6265] chr2:152560403:15256.. 429-443 - | MOTIF CCATTGCCCAGGCTT 5.58788 9.92e-05 0.0949 CCCTCCCCCAGCCTC
[6266] chr14:27147580:27148.. 454-468 - | MOTIF CCATTGCCCAGGCTT 5.58788 9.92e-05 0.0949 CCCTCCCCCAGCCTC
-------
seqinfo: 104 sequences from an unspecified genome; no seqlengths
>
The next step in the linked tutorial splits data based on a "parameter of interest" involving some peak metadata that I don't have. I want to skip this step, but everything falls apart when I try to move forward. This is the code from the tutorial:
# Returning the sequence of the matched regions can be used to re-derive PWMs
# from different match categories as follows (here done for different binding categories):
motifs_by_binding <- fimo_results_with_seq %>%
# Split on parameter of interest
split(mcols(.)$peak_binding_description) %>%
# Convert GRangesList to regular list() to use `purrr`
as.list() %>%
# imap passes the list entry as .x and the name of that object to .y
purrr::imap(~{
# Pass the sequence column to create_motif to generate a PCM
create_motif(.x$sequence,
# Append the binding description to the motif name
name = paste0("E93_", .y))
})
# Motifs from each category can be visualized with universalmotif::view_motifs()
motifs_by_binding %>%
view_motifs()
I have been working on this for two days and can't figure out how to view my motifs. My question for you all is: How can I view motifs from a GRanges object like in the tutorial? Here are some things I have tried:
- Omitting the split() line entirely gives the error: Error in getListElement(x, i, ...) : GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment
- Changing split() line to: split(mcols(.)) , split(mcols)), and mcols() gives various errors such as: Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.list': error in evaluating the argument 'f' in selecting a method for function 'split': unable to find an inherited method for function ‘mcols’ for signature ‘"missing"’ and Error in h(simpleError(msg, call)) : error in evaluating the argument 'input' in selecting a method for function 'create_motif': $ operator is invalid for atomic vectors
- Breaking apart pipe like:
gives the error: Error in h(simpleError(msg, call)) : error in evaluating the argument 'input' in selecting a method for function 'create_motif': $ operator is invalid for atomic vectorstest <- mcols(fimo) test <- as.list(test) test <-test %>% purrr::imap(~{ # Pass the sequence column to create_motif to generate a PCM create_motif(.x$matched_sequence, # Append the binding description to the motif name name = paste0("E93_", .y)) })
I don't quite understand how imap works or how to break that part of the pipe apart, so perhaps there is something that can be done there.
- Limiting data to 1 motif. I thought perhaps there was a problem with trying to plot multiple motifs, so I subset my data to only include data from the first motif. This gave the same "$ operator is invalid for atomic vectors" error.
I would like to view the motifs using view_motif() and plot_sequence_heatmap(). Any suggestions, please and thank you!
This is a non-answer but i dont have time at present to download the stuff, get as far as you have etc. (sorry). What I do have time to do is tell you that my next step would be to download some peak data from somewhere and see if that solves the problem.
Since that's where its breaking down, just humor the vignette and provide it what it wants. Most vignettes should provide everything you need for a basic tutorial. have they not done that?
I appreciate your non-answer! I totally understand that downloading all the stuff would be time consuming. I guess I was hoping that perhaps someone who has worked with the R memes (specifically fimo) package might have an idea. Your suggestion makes sense, but I don't know anything about peak data and am concerned it would alter my analysis. I agree that most vignettes should provide everything you need for a basic tutorial, but it seems simply looking at the motif output from fimo is more challenging than I thought.
i 100% agree. in fact, if it were possible to more than 100% agree - say 145% or even 200% - i would go with that instead. But the idea is that you will figure out what is happening by running it more easily than you will figure it out by not running it.
the other thing is, as far as worry it will alter your analysis, thats not hard to test right? you just download 2 tracks from UCSC or wherever ( https://hgdownload.soe.ucsc.edu/downloads.html ). Do the two generate different results? If so, since only difference between the two is the peak data, you have your answer - Yes, clearly it is affecting your analysis. If not you are now done with this question. If yes, you don't naively use that data, you chunk both sets of results in a folder called 'prep' or 'benchmarking' or what have you, but then you study the results generated to see if it helps you understand where you are going wrong.
I am building a tool at present that leverages over 4,000 data modalities from diverse fields, including toxicology, pharmacology, genetics, protein biochemistry, systems biology, etc. I am not an expert in every one of those fields, but I can build a tool that functions in a highly reproducible way irrespective of that by thinking scientifically about the problems. the suggestion above is along those lines. You don't know if it will affect your results? well cool, that sounds to me like it can be tested. similarly, i routinely verify whether or not any assumptions made or knowledge I lack appears to interfere with the results I generate under a variety of circumstances...
to me, more than the specific issue that is generating the error, THIS is the problem. in the long term definitely, in the short term, also definitely. if you dont understand peak data, and the package you are running may or may not require its use, but we dont know because we haven't read the docs or studied peak data, THAT is what needs to change first.
I may have time tonight to get it running. no promises but will try.
VAL
Thanks for your thoughts. Since the only requirements to use runFimo() are a fasta file and motifs, I feel like what I am doing should work and I am probably making a simple error in syntax or something. I will study peak data today and follow the rabbit hole, per your suggestion.
ok so i looked at this - it looks like the peak data are just being used to grab the genomic ranges using Granges ... which makes sense because peak data is a metric of tf binding occupancy (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3799470/).
so if you have valid ranges thats all you need from the peak data. do you have the fastas formatted exactly as they say a the top of the page?
also, which problem do you want solved the most:
1) to run fimo 2) to get the sequences 3) ???
becacuse you dont need this software to do 2., and you can make a heatmap without it too. so what exactly is it that it is most important we solve?
VAL
what version of R are you on? I might as well synchronize.
You're so cool for looking into this further with me. I am using R version 4.2.1. I arrived at the same conclusion, that peak data are being used to include interesting metadata, but aren't necessary. My fastas are formatted correctly, but that isn't the issue because I am using importFimo() to import a .tsv generated from the MEME suite's fimo test on the web browser. Honestly, I think the first problem is figuring out how to turn the GRanges object into a list because the create_motif function works with a list but not with a GRanges object. In the tutorial, they use split() with a metadata column and this changes the object type from GRanges object to a CompressedGRangesList. Then, they use as.list() and move on to create_motif. If you try to skip the split(), and just say as.list(), you get the error that "GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment". If I try to use split() on a different metadata quality, like motif_id, then I end up with a list of GRanges objects which gives an error when I try to create the motifs because "unable to find an inherited method for function ‘create_motif’ for signature ‘"GRanges"’". In summary, I believe the list formatting is the first problem!
did you run the tutorial to completion? it allllllways helps. will start in a sec i need install 4.2 and dependencies
No, I am still stuck in the same place outlined in my original question. The split() line.