EDAM Ontology

Bioinformatics operations, types of data, formats, and topics

  1. Introduction
  2. Concepts
  3. Relations
  4. Rules
  5. Sources
  6. Guidelines for annotators
  7. Existing implementations and annotations with EDAM



Introduction



Motivation

Bioinformaticians handle an increasingly large and diverse set of tools and data. Meanwhile, researchers demand ever more powerful and convenient means to organise, find, compare, select, reuse and connect the available resources. These tasks often rely on consistent, machine-understandable descriptions of the underlying components, but these have been generally lacking in documentation and metadata developed ad hoc. There is therefore an urgent need for an ontology that unifies semantically the bioinformatics concepts in common use and, for the annotator, provides a comprehensive controlled vocabulary that is broadly applicable.



What is EDAM?

EDAM (EMBRACE Data and Methods) is an ontology of common bioinformatics operations, topics, types of data including identifiers, and formats. EDAM comprises common concepts (shared within the bioinformatics community) that apply to semantic annotation of resources such as:



Scope

As a rule (with a few exceptions) EDAM only includes concepts strictly in domain of bioinformatics/computational biology. General biology, medicine, or informatics concepts are (typically) not included in the EDAM Operation, Data, or Format sub-ontologies. However, broader inter-disciplinary topics are included in the EDAM Topic sub-ontology.

For example within drug design, the concepts related to sequence, structure, and molecular interactions analyses are part of EDAM, but ones related to pre-clinical and clinical trials are not. As another example, text mining in patient records and statistical analysis with respect to knowledge about diseases are in scope, but management of patient records or statistical analysis with respect to economy or administration are not.

EDAM includes 4 main sub-ontologies of concepts organised into simple hierarchies:

Noteworthy within the Data sub-ontology is:

These provide different semantic 'axes' for annotation. For example, annotation of a Web service might include:



Architecture

EDAM has 4 components:

Concepts - These are well established and familiar bioinformatics concepts. Concepts have a name (term), a definition and one or more simple relations to other concepts defined in EDAM. Each concept has one or more intrinsic properties (reflected in the definition and relations).

Hierarchy - Every concept (excluding top-level concepts) is related to (typically) one other concept within the same sub-ontology by an is a (generalisation) relation. These relations define the sub-ontology hierarchies. All "child" concepts must share the intrinsic property of their "parent", in addition to having their own intrinsic properties.

Relations - Concepts are related by defined relation types, but these types of relation apply also between other entities outside of EDAM (for example artifacts semantically annotated by EDAM or another ontology)

Rules - There are simple rules dictating how different types of concepts are related within EDAM. They define which relations may be specified for which concepts. They reflect well established or self-evident principles.



EDAM sub-ontologies

Figure 1. The EDAM architecture is intentionally simple. Boxes indicate top-level concepts (sub-ontologies), and lines indicate types of relations that are maintained between concepts in EDAM.



Download and Status

Locations for download in OWL format:

http://edamontology.org/EDAM.owl (Always points to the last stable version)

Locations for download in OBO format:

http://edamontology.org/EDAM.obo (Always points to the last stable version. OBO-format version lacks certain details.)

All versions:

http://sourceforge.net/projects/edamontology/files/

Version 1.2 of EDAM has been released . It uses the concepts, relations and rules below. Contributions and suggestions are welcome.

EDAM is being actively developed:

For further information see the EDAM Wiki:

https://sourceforge.net/apps/mediawiki/edamontology/index.php?title=Main_Page

See the EDAM presentation at the BioOntologies SIG, ISMB 2011:

http://bio-ontologies.knowledgeblog.org/224



Viewing

EDAM is available in the following Web-based ontology browsers:



Licence

EDAM is made available for everyone to use, with the following constraints on its use or redistribution (including online access):

The intellectual content of EDAM or of any of its parts cannot be included in other projects and artifacts unless agreed with the authors of EDAM.



Publication

If you use EDAM or its parts, please reference:

Ison, J., Kalaš, M., Jonassen, I., Bolser, D., Uludag, M., McWilliam, H., Malone, J., Lopez, R., Pettifer, S. and Rice, P. (2013). EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics, 29, 1325-1332.
doi: 10.1093/bioinformatics/btt113   PMID: 23479348

This article is freely available (Open Access).



Contact

All enquiries to Jon Ison (jison@ebi.ac.uk) CC'ing Matúš Kalaš (matus.kalas@uib.no)

Thanks for valuable discussions and contributions to Peter Rice, Inge Jonassen, Dan Bolser, Rodrigo Lopez, Gert Vriend, James Malone, Steve Pettifer, Hamish McWilliam, Alan Bleasby, Mahmut Uludag, László Kaján and others.



Mailing lists

Feel free to subscribe to the mailing lists:

Once subscribed, you can mail the user and developer lists:

edamontology-announce is for announcements (very minimal traffic!) edamontology-developers is for technical discussions between EDAM developers / contributors. edamontology-users is for general discussions and announcements.




Concepts



Operation

"A function or process performed by a tool; what is done, but not (typically) how or in what context."

e.g. "Sequence alignment", "Pairwise sequence alignment", "Sequence database search".

"Operation" concepts provide mostly fine-grained concepts for annotation of tool functions.

The top-level concepts are:

The top-level operations are necessarily coarse-grained (abstract) providing a navigable top-level. They serve as placeholders for other, more specific concepts lower down in the tree.



Data

"A type of data in common use in bioinformatics."

e.g. "Sequence alignment", "Comparison matrix", "Phylogenetic tree" etc.

Data concepts:

The top-level concepts are:

Their meaning is:

Concepts within "Core data" are:



Topic

"A general bioinformatics subject or category, such as a field of study, data, processing, analysis or technology."

e.g. "Sequence analysis", "Alignment", "Sequencing", "Microarrays".

"Topic" concepts provide coarse-grained categories for annotation of diverse bioinformatics resources. They do not cover biology or computer science exhaustively.

The top-level concepts are:



Format

"A specific layout for encoding a specific type of data in a computer file or memory."

e.g. "FASTA format", "PDB format", "mmCIF" etc.

"Format" concepts:

The top-level concepts are:

All concepts are nested under "Binary format", "Textual format" and "XML", with exception of pure "HTML" or "RDF" (and "BioPAX"). The "Format (typed)" branch arranges formats by type of data and provides an additional axis over (the same set of) concepts under "Binary format", "Textual format" and "XML".



Identifier

"A label that identifies (typically uniquely) something such as data, a resource or a biological entity."

e.g. "UniProt accession", "EC number", "Gene symbol" etc.

"Identifier" concepts:

The top-level concepts are:

As for "Format", the "Identifier (typed)" branch provides an additional axis over (the same set of) concepts under "Accession" and "Name".




Relations



is a

Defines a concept as a specialisation of another concept. If A is a B, then A is a specialisation of B, and B is a generalisation of A.

The is a relation is transitive: if A is a B and B is a C then A is a C.

All relations are transitive over is a: e.g. if A has input B and B is a C then A has input C, and if A is a B and B has input C then A has input C.

e.g. "Pairwise sequence alignment" is a "Sequence alignment"



has input

Defines an "Operation" concept as reading (inputting) a "Data" concept.

e.g. "Sequence alignment construction" has input "Sequence"



has output

Defines an "Operation" concept as writing (outputting) a "Data" concept.

e.g. "Sequence alignment construction" has output "Sequence alignment"



has topic

Defines a "Data" or "Operation" concept as being within the scope of a "Topic" concept.

e.g. "PolyA signal identification" has topic "Nucleic acid sequence analysis"



is identifier of

Defines that an "Identifier" concept identifies a "Data" concept.

e.g. "Sequence accession number" is identifier of "Sequence"



is format of

Defines that a "Format" concept is the format of a "Data" concept.

e.g. "Sequence format" is format of "Sequence record"




Rules

Rules define how concepts are related.



Rules by concept type

"Topic"

"Operation"

"Data"

"Format"

"Identifier"



Rules by relation type

is a

has input

has output

has topic

is identifier of

is format of




Sources

Various resources were analysed while constructing EDAM and were used as sources listing common bioinformatics concepts in scope.

Web services and applications

Domain ontologies, taxonomies, data models

For database-related concepts

  1. dbxref.txt (databases cross-referenced in UniProtKB/Swiss-Prot)
  2. List of databases collated by the ELIXIR project
  3. Lists of databases from the Web

Other resources




Guidelines for annotators

Annotators may email Jon Ison (jison@ebi.ac.uk) and Matúš Kalaš (matus.kalas@bccs.uib.no) for help.



General guidelines

Which EDAM sub-ontology to use?

  1. "Topic" for coarse-grained annotation of diverse entities
  2. "Operation" for fine-grained annotation of tool functions
  3. "Data" for annotation of data in semantic terms
  4. "Format" for annotation of the syntax or format of data
  5. "Identifier" (as a special type of "Data") for annotation of identifiers (names and accessions) of data or other entities

Use of other ontologies

The expectation is for EDAM to be used alongside other ontologies for annotation where possible and desirable. For example, an operation that predicts specific features of a molecular sequence could be annotated with concepts from SO (Sequence Ontology) for the features.

Picking concepts

If you have many annotations to do, it will help to familiarise yourself with EDAM first using a browser (see Viewing).

  1. Identify the correct sub-ontology ("Operation", "Data" etc.) of concepts considering what is being annotated (see above)
  2. Search EDAM using keywords to find candidate concepts. Multiple searches using synonyms, alternative spellings and so are preferable.
  3. Pick the most specific concept(s) available, bearing in mind some concepts are necessarily overlapping or general.
  4. Only pick a correct concept. If it doesn't exist, request it's added to EDAM


Annotation of Web services

Model of a Web service

A Web service is considered as an arbitrary (but usually related) set of one or more operations, reducing the problem of Web service interoperation to one of compatibility between operations.

Operation

Input

Output

XML elements

Levels of annotation

Annotation of a WSDL file or associated XSD schema is possible at several levels. Assuming SAWSDL annotation (http://www.w3.org/TR/sawsdl/), the XML elements that may be annotated by EDAM concepts are:

  1. Web service (as a whole) (<wsdl:portType>)
  2. Operation (<wsdl:operation> inside <wsdl:portType>)
  3. Input parameters and their sub-parts (<xs:element>, <xs:complexType>, <xs:simpleType>, <xs:attribute>)
  4. Output parameters and their sub-parts (<xs:element>, <xs:complexType>, <xs:simpleType>, <xs:attribute>)

NB. The input and output parameters should be annotated inside the XML Schema that defines them. In case of services that are not following the highly recommended document/literal wrapped SOAP-binding style, the <wsdl:part> inside <wsdl:message> can be annotated (the same applies to faults, but meanings of faults are not modelled by EDAM)

The following annotations might be useful but are not directly recommended by SAWSDL:

  1. Enumerated values of input/output parameters (<xs:enumeration>)

For details of incorporating the SAWSDL annotations into WSDLs and XSDs, see EDAM URIs and SAWSDL annotation.



EDAM URIs and SAWSDL annotation

SAWSDL mandates the use of sawsdl:modelReference attributes for annotation. The format of EDAM URIs used inside this attribute includes the ontology name (http://edamontology.org), main sub-ontology, and the unique identifier (ID) of the particular concept:

 
<xs:element name="elementName" sawsdl:modelReference="http://edamontology.org/subontology_id">

Where ...

The value of the sawsdl:modelReference attribute is a URI pointing to the concept definition. The URI to use is in case of EDAM includes the concept's sub-ontology:

So for these 3 concepts:


EDAM_topic:0182

EDAM_operation:0292

EDAM_data:0863

We'd have

http://edamontology.org/topic_0182
http://edamontology.org/operation_0292
http://edamontology.org/data_0863

Which can be used in SAWSDL annotation, e.g.

<wsdl:portType name="myService" sawsdl:modelReference="http://edamontology.org/topic_0182">
<sawsdl:attrExtension sawsdl:modelReference="http://edamontology.org/operation_0292>
<xs:element name="outfile" sawsdl:modelReference="http://edamontology.org/data_0863>

If more than one annotation of an element is required, these can be given in the sawsdl:modelReference attribute delimited by space characters:

<wsdl:portType name="myService" sawsdl:modelReference="http://edamontology.org/topic_0182 http://edamontology.org/operation_0292">

NB. Such multiple annotations need not be in the same namespace, and need not at all to refer to the same ontology.

SAWSDL guidelines for annotating operations

One peculiarity of the SAWSDL specification is that annotations on <wsdl:operation> element inside <wsdl:portType> should be handled using a <sawsdl:attrExtensions> element. This is not a requirement for other elements.

Importantly, the <sawsdl:attrExtension> element inside the wsdl:operation must be before <wsdl:input>, <wsdl:output> and <wsdl:fault> elements (so typically after the <wsdl:documentation> element).

For example:

 <wsdl:portType name="Clustalw2PortType" sawsdl:modelReference="http://edamontology.org/topic_0186 http://edamontology.org/operation_0496">
         <wsdl:operation name="submitClustalw2">
                 <wsdl:documentation>Submit a sequence and get a jobID</wsdl:documentation>
                 <sawsdl:attrExtensions sawsdl:modelReference="http://edamontology.org/operation_0496"/>
                 <wsdl:input message="submitClustalw2Msg"/>
                 <wsdl:output message="submitClustalw2ResponseMsg"/>
         </wsdl:operation>

Some WSDL/XSD validators or SOAP libraries do not check for it, but some do require the strict order of these elements.




Existing implementations and annotations with EDAM

EMBOSS

EMBOSS applications have been annotated using EDAM and these annotations appear in corresponding Web services.

Annotated WSDL files (and associated XSD data schema) are available from:

You will see a list of service end-points with WSDL URLs. For example:

To see the data schema associated with a WSDL, you must replace "?wsdl" with "?xsd=1", "?xsd=2" or "?xsd=3". For example:

BioXSD

The BioXSD XML schema (XSD) defines exchange formats of everyday bioinformatics data types. BioXSD aims to serve as the common, canonical data model for bioinformatics Web services. It includes commonly used types including sequences, sequence annotations, alignments and references to resources:

BioXSD has been annotated with EDAM concepts.

DRCAT biological resource catalogue

A catalogue of data resources (DRCAT) is being compiled as part of the EMBOSS project. Each entry in DRCAT gives metadata on a data resource available on the Web. The metadata includes "Query" lines describe the type(s) of data available, the data format, data identifier (used to query) and a URL from which data can be retrieved. The "Query" lines and the resources themselves are annotated with EDAM concepts.

A typical entry is shown below:

(NB. The format of EDAM ids has not been upgraded to version 1.0 yet. Will be done asap.)

ID      PDB
Acc     DB-0070
Name    The RCSB Protein Data Bank
Desc    A repository for 3D biological macromolecular structure data.
URL     http://www.rcsb.org/pdb/
Cat     3D structure databases
EDAMres EDAM:0000693 | Tertiary structure
EDAMdat EDAM:0000883 | Tertiary structure
EDAMdat EDAM:0002085 | Structure annotation
EDAMfmt EDAM:0001476 | pdb
EDAMfmt EDAM:0001478 | pdbml
EDAMfmt EDAM:0001477 | mmCIF
EDAMfmt EDAM:0002331 | HTML 
EDAMid  EDAM:0001127 | PDB ID
Xref    SP_explicit | None
Xref    SP_FT | None
Xref    EMBL_explicit | None
Query   EDAM:0002085 | EDAM:0002331 | EDAM:0001127 | http://www.pdb.org/pdb/explore/explore.do?structureId=%s
Query   EDAM:0000693 | EDAM:0001476 | EDAM:0001127 | http://www.pdb.org/pdb/files/%s.pdb
Query   EDAM:0000693 | EDAM:0001477 | EDAM:0001127 | http://www.pdb.org/pdb/files/%s.cif
Query   EDAM:0000693 | EDAM:0001478 | EDAM:0001127 | http://www.pdb.org/pdb/files/%s.xml
Example EDAM:0001127 | 1rbp
Email   deposit@deposit.rcsb.org
CCmisc  EMBL DR line example "1OSN", /dbxref="PDB:12GS"
Status  Referenced

DRCAT development will proceed in harmony with bioDBCore, which proposes a community-defined, uniform, generic description of the core attributes of biological databases:

bioDBCore is under the auspices of the International Society for Biocuration:

All enquiries to Jon Ison (jison@ebi.ac.uk)

Bio-jETI

Bio-jETI allows automatic composition of functional units into software systems according to higher-level specifications using EDAM:

iHOP Web service

The iHOP Web service is annotated with EDAM concepts, either directly or via its use of BioXSD:

CBU Web services

The Web services provided by the Computational Biology Unit (CBU) of the University of Bergen and its affiliated Uni Computing are annotated with EDAM concepts:

eSysbio

The eSysbio workbench for sharing and analysing bioinformatics data using public or private Web services and R scripts. eSysbio uses EDAM to annotate and denote the type and format of data items submitted to the system.

SEQwiki

The SEQanswers wiki is an open catalogue of bioinformatics software tools, non-exclusively focussed on sequencing data analysis. SEQanswers tool wiki uses EDAM for annotation of the listed tools where applicable.



Last update: 2013-July-01