Forum:List Of File Formats Used In Bioinformatics?
3
4
Entering edit mode
12.2 years ago

I have looked and looked through biostars (and googled crazily) and for some reason I cannot find a complete list with file formats used in bioinformatics. There are some lists like http://bioinf.comav.upv.es/courses/sequence_analysis/sequence_file_formats.html or http://www.molecularevolution.org/resources/fileformats but they are rather incomplete or too narrowed.

I was expecting someone compiled a file format database, but I was very disappointed. Do you know more complete lists?

Thanks

bioinformatics • 13k views
ADD COMMENT
4
Entering edit mode

A new program = a new format :-)

ADD REPLY
2
Entering edit mode

That's for a reason. Existing file formats are ridiculous! Come on, 'FASTA'? Is there a bigger mistake than this format? Then they replace it with a 'much better' format: FastQ. So, now they now store (large) BINARY data in plain text file! No wonder there are so many FastQ 'formats'. I don't know why bioinformaticians are so afraid of binary files! With the time wasted to scan a single line of text in a FASTQ file to find its true end (LF, CRLF, etc) a program could process over 100 entries in a binary file.

:)

ADD REPLY
3
Entering edit mode

also: http://bioinformatics.roslin.ac.uk/lawslaws/ "The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.""

ADD REPLY
7
Entering edit mode
12.2 years ago
Michael 55k

Have a look at EDAM ontology. You can browse its format concept here: http://bioportal.bioontology.org/ontologies/47814/?p=terms&conceptid=EDAM_format:1915

EDAM (EMBRACE Data and Methods) is an ontology of common bioinformatics operations, topics, types of data including identifiers, and formats.

ADD COMMENT
3
Entering edit mode
12.2 years ago
Mitch Skinner ▴ 660

I often end up working from the UCSC file format description page. I don't think there can be a comprehensive list, because people are coming up with new formats all the time.

ADD COMMENT
0
Entering edit mode
12.2 years ago

There are many but some of the prominent ones especially used in next generation sequencing analysis: 1) Fastq, Fasta format 2) SAM/BAM format 3) VCF format 4) Wig format 5) BED format 6) GTF/GFF3 format

ADD COMMENT

Login before adding your answer.

Traffic: 1243 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6