How does IdentifyLife actually work?

This page is a relatively technical description of IdentifyLife for users who want to know the nuts and bolts of the IdentifyLife system. For more general information about using IdentifyLife, see the other topics on the Home page.

IdentifyLife comprises a set of web-based databases and applications for capturing, storing, managing and deploying descriptive information about taxa. It uses HBase an open-source, distributed, versioned, column-oriented data store modeled after Google's Bigtable architecture. HBase is able to host very large tables - billions of rows x millions of columns - and provides fast searching across records.

IdentifyLife tables

The three core tables in IdentifyLife are a taxon table, a characters table and a descriptlet table.

The taxon table used in IdentifyLife is based on the Catalogue of Life  a growing catalogue intended to include all known organisms. The taxon table in IdentifyLife uses the taxonomic hierarchy captured in the Catalogue of Life. By using a hierarchy, coding in IdentifyLife can be done with maximum efficiency - features can be coded at the highest appropriate taxonomic level. IdentifyLife has been built to handle multiple alternative hierarchies; at the moment, for convenience, it deals only with the Catalogue of Life hierarchy.

The characters table in IdentifyLife stores characters and characters states. For each character and state, IdentifyLife stores a name and definition, and optionally notes on the usage of the character and one or more links to images illustrating the character or state. All characters are stored in a single table, with any given character associated with one or more IdentifyLife projects.

The descriptlets table is used to store the actual descriptive data at the heart of IdentifyLife. A descriptlet is a unitary statement about a feature of a particular taxon, and has the form "taxon x, for character state y, has the value z with modifiers m1,m2 etc". Taxon x and character state y in the descriptlet are references to a taxon in the taxon table and a character state in the characters table. The value in the descriptlet is either Present|Absent|Don't know (for multistate characters) or a range of numbers (for numeric characters). The modifiers are used to modify the descriptlet's value. Modifiers are of three types - certainty modifiers, inference modifiers, frequency modifiers and misinterpretation modifiers. These allow a score to be set as certain or possible, given or by inference, always or sometimes, and true or by misinterpretation.

Two types of projects are currently supported in IdentifyLife - standard list projects and key projects.

One or more users would create a standard list project when they plan to use IdentifyLife to create an agreed, standardised character list (ontology) for a given set of organisms. For example, one group of users may create a standardized list of characters for crustaceans, another for ferns. Standard lists may also be created for subtaxa of these, as for crabs (a subgroup of crustaceans) or for the genus Asplenium (a genus of ferns). Characters may be common to two or more standard lists, so the list for crabs may be a mix of special characters designed just for crabs, and more general characters borrowed from the standard list for crustaceans.

Key projects are created when one or more users plan to build a key to a given set of organisms, such as a key to the crabs of Hawaii or the ferns of Thailand. A key comprises a set of characters chosen from a standard list, a set of taxa chosen from the Catalogue of Life taxa stored in the Taxon table, and a set of descriptlets provided to IdentifyLife through a key project. Building a key involves choosing which character states are present and which are absent in any given taxon. Whenever a character state is scored as present, absent or "don't know" in a key project, a descriptlet is created in the descriptlet table.

Since all descriptlets in IdentifyLife are stored in a single large table, different key projects may share descriptlets, just as they may share characters, states and taxa. This means that whenever a user working in one project provides some information to IdentifyLife, other users working on other projects may benefit. IdentifyLife's core principle is sharing data rather than reinventing data.

The implications of sharing - normativity, attribution and relationships

The core principal of IdentifyLife - sharing data - has brought many challenges to its design. Three features are used to help manage IdentifyLife data in a shared collaboration space. These are normativity of characters and states, attribution of data, and relationships between items.

Normativity is at the heart of IdentifyLife's collaboration model, and allows users to have confidence when using shared items. An item in IdentifyLife, such as a character or character state, may be declared to be normative when it is complete and agreed and its creator(s) agree that it should not need to change in the future. For example, when a character is created in a standard list project, provided with agreed names, definitions, usage notes and images and is regarded by the project's collaborators as complete, it may be marked as normative. If another user then includes that character in their own project, they can be confident that it will not be edited by another user or by the character's originator.

Normativity does not lock characters and character states completely, as instances will arise when characters need to be revised. But it allows for some control over the revision process. For example, once a character or state is declared normative, if the originator of the character tries to change it and the character or state is already in use in another project, Identify will raise a warning and request that the edit be discussed with the other users of the character or state before the edit proceeds. Above all, the meaning of a character or state declared normative should not be changed, as other users will have proceeded to use it under its original meaning.

Attribution is IdentifyLife's way of tracking the ownership of items and the contributions made by IdentifyLife's members. All core items in IdentifyLife other than taxa - characters, states, their definitions and images, and descriptlets - are attributed to the user who provided or created them. These attributions are then used to report, on the My IdentifyLife page, the extent of a user's contributions.

Relationships between items are used to help manage information in IdentifyLife and to maximize the potential for sharing information across individual projects while at the same time maximizing flexibility for contributors. If one contributor uses IdentifyLife to build a key to, say, ferns of Thailand and another uses it to build a key to ferns of Vietnam, both will hopefully base their characters on those in a standard list for ferns. IdentifyLife will maintain three sets of characters and establish a reference between each of the key characters with its normative equivalent in the standard list. If a character in the key to ferns of Thailand and one in the key to ferns of Vietnam are both referenced to the same normative character, then IdentifyLife can assume that they are both the same character. In future, when contributors are able to upload pre-existing, legacy keys into IdentifyLife, this feature will allow the characters in the uploaded key to be associated with characters in a standard list, and through that to other characters elsewhere in IdentifyLife. This will allow data to flow from one key to another accurately without mismatch or misinterpretation.

The IdentifyLife community

IdentifyLife has a large vision, to provide a space where descriptive information on all the world's organisms can be captured, managed and coordinated. For this vision to be achieved, IdentifyLife relies on a community of users and contributors. However, IdentifyLife will fail if the contributors find that contributing to IdentifyLife takes up time and resources that would be better spent on other projects. For this reason, we seek to make IdentifyLife a time-and labour-saving space. By sharing information between projects, over time each project should benefit compared with individuals working on disconnected, isolated projects.

We also value feedback, discussion and engagement by our users on the direction of IdentifyLife and its design. Throughout IdentifyLife are blogs and community forums where discussions can take place and ideas and feedback provided, ranging from IdentifyLife-wide blogs and discussions to project-specific ones. Over time we also intend to provide rss feeds, email alerts and active polling to help promote engagement with the IdentifyLife project.

IdentifyLife's open architecture - what else can be done with its data?

One principal use case for IdentifyLife's descriptive data is to allow identifications of organisms, ranging from the Key to All Life project to tailored keys for individual groups of organisms in specific regions. However, descriptive data can be used for much more than identification and IdentifyLife is being built with an open, web service architecture to encourage development of more uses for its data. Many IdentifyLife functions can be queried using open web services to return, for example, all descriptive information for a given set of taxa, all taxa that match a given set of character states, or the differences or similarities between any two or more taxa. These web services can be used by third parties to build their own identification systems, or annotation services for phylogenetic trees, or to answer novel questions such as where in the world particular character states are most common. We expect that IdentifyLife's open architecture will allow questions to be asked of its data that no-one has considered before, and to help a broad range of research and activity beyond simple identification.