I think the goal is clear and good - to abstract away the problem of different file formatting to something that users understand: I want --> FASTA for --> bowtie. I want --> Bedgraph for --> bedtools. etc.
However, I can see this abstraction having three possibly difficult issues to resolve:
1) Tools obviously change, so right now STAR takes only pair-split FASTQ files, not a single interleaved FASTQ file. This might change in the future, meaning that today's "--> STAR" format might not be tomorrow's "--> STAR" format.
2) Where two programs both support the same format (eg, in the future perhaps both STAR and Tophat both support an interleaved FASTQ), but "--> STAR" actually means read-pair-split and "--> Tophat" means interleaved due to legacy reasons, you'll get people downloading 2x as much data from your site. It isn't a 1:1 mapping.
3) "My boss was very specific and told me to get him a half-open half-closed 0-based bedgraph format with integers not floats, binned in 250bp regions -- is that bedgraph of bedops formatting?" 😵🔫
The idea of mapping formats to tools that support them is a fantastic idea -- however, it would be nice if Ensembl gave you the option to choose your data format very specifically like in example 3), but if you don't know what you want, take you to a handy look-up page that can stay updated - perhaps a grid of tools and the formats they currently support. Clicking on a tick mark in such a table could autofill the more detailed form out for you as per example 3).
Its not easy balancing the highly technical desires of some with the ease and simplicity of non-technical software others are used to, but i'm really happy to see that Ensembl is making efforts in this area :)
One can't really post a single tool name as it would be too short as a content.
In general I have observed that scientists like to use meaningful gene names. One of the most common needs is that of "how do I go from an ensemble gene/transcript name to a descriptive name".
A second important need is that of providing high quality and reliable information rather than a comprehensive all-inclusive one. At the same time it is important to understand the rules by which these were created. For example the annotation tracks in IGV are very informative but I don't know what basis were these selected by.
It would be nice to be able to download (FTP) mapping files for mapping between common identifiers. I realize there is BioMart, but I've not had the best experience with it and ended up using the ensebml MySQL when I'm trying to get the whole gene/transcript/protein set for a species.
@Emily: You should clarify your question to indicate if you are looking to get a list of all NGS tools or just those which have a strict dependency on reference data (which will be provided by Ensembl).
I thought that was self-evident.
Perhaps an edit to the title to clarify, "popular tools that require reference data"? Current title sounds like any NGS tool is acceptable.
At the minimum, Ensembl should provide a concatenated reference genome. I know there are "toplevel" FASTAs, but for human, it contains ALT contigs, which most users wouldn't want to use for general mapping. It is interesting that no official databases (I am talking about ucsc/ncbi/ensembl) provide concatenated GRCh37, which is partly why we see so many variants of GRCh37. GRC now provides concatenated GRCh38. I hope Ensembl can do the same, as Ensembl and GRC/UCSC have different naming.
Would these be offered as bundles (like what iGenomes provides)?
Don't understand GTF/GFF/FASTA in different formats part.
Not bundles, you would choose whether you needed FASTA, GFF or GTF (potentially to expand out to more file types if there is a need), then which tool you intended to use, then it would spit out a GFF (or whatever) file with the chromosome names formatted how you need them, info fields filled in how you need them etc. Even though these are standard formats, it seems that everybody actually makes them differently, so we're trying to make it so that you can get them in the style you need for the tool you're using, hence wanting to know what tools people use.
In order to save a bunch of separate answers let me get some common tools out of the way (in no particular order).
BWA
BOWTIE 1/2
BBMap
GSNAP
HISAT2
STAR
TopHat
Salmon
Kallisto
Bedtools
Bedops
Can it be setup such that the FASTA and GFF files have the same names for entries? NCBI will sometimes give only the RefSeq ID in the GFF, but in the fasta file it will give the whole title (gbk/refseq/name/etc). Not a huge issue, but it is a little annoying.