[IUCr Home Page]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Phase Identifiers.



Dear Colleagues

Once the scientific discussion of phase identifiers begins in earnest, you
will hear very little from me, for I am no expert in such matters. But as
consultant I wanted to start the ball rolling with a few background ideas
to point out some of the technical considerations that we need to bear
in mind.

IDENTIFY - v.t. - to make, reckon, ascertain or prove to be the same [Late
Latin identificare - idem, the same; facere, to make] (Chambers English
Dictionary)

David's position paper accurately highlights the purpose of an identifier as
a label, name or designation that demonstrates that two instances of, or
references to, a thing are the same. In practice one recognises that proving
that two things are the same can be difficult, and usually involves
comparing limited information in a particular context. For example, I have
two correspondents called "David Brown", but I have no problem
distinguishing between their messages, because they have different
associated email addresses and my mail tool can group according to address.
On the other hand, the email identifier is useless if I try to tell someone
at a meeting "Please give this book to David Brown in the next room - you'll
recognise him by his email address". But I might be able to use information
on the person's name badge ("He's the David Brown from McMaster in Canada"),
or I might use some additional external clue ("He's the David Brown with the
beard"). In this example, I have never actually met the "other" David Brown,
so I don't know a priori whether the beard test is a good enough
discriminator.

The fact that different amounts of information may be known by different
people (or at different times) argues for an identifier - or series of
identifiers - that can be applied piecewise to match the specificity of a
search. In this respect the CCN nomenclature system for phase transitions is
quite far sighted: it provides a number of fields, not all of which need
to be populated, and database searches could be conducted using a subset of
the available fields. Where appropriate, matching hits can be compared using
some of the other fields in an attempt to confirm or refute the identity of
the matches.

My quibble with the scheme as it presently stands is that the fields are
designated by order, and certain fields may serve different purposes
according to the type of material or phase concerned. A far stronger
scheme would tag each field in some unambiguous  way, so that its purpose
was clear.  Ordering or absence of individual fields would therefore be
irrelevant.

One approach to defining an identifier would be to draw up a set of tags
representing desirable fields. A natural way (for at least some of the
members of this Working Group) would be to construct a small CIF dictionary
containing the new tags. This would have the incidental benefit of being
expandable to define other data names characterising or related to phases,
and becoming in time a fully-fledged CIF extension dictionary (or, if small
enough, being incorporated into the core dictionary).

This does not imply that the final form of an identifier be CIF-like; but it
would allow us to concentrate initially on rigorously defining the
characteristics that should be incorporated into the identifier. Once
the content is established, a more printer-friendly notation can be
generated if desired.

Note also that a dictionary established to describe phase identifiers could
equally well accommodate additional identifiers, suitably tagged. That is,
"external identifiers" in David's nomenclature - e.g. database reference
codes - could be added to the dictionary and serve the purpose of (a)
locating records in a specific database; (b) confirming an identification by
allowing comparison of external identifiers where they are available. As
David notes, assignment of such "external identifiers" is really the
preserve of a registration authority, and at present such authorities would
seem naturally to be managers of databases of the objects of interest. If
there are no existing databases of phases, members of this working group
with contacts in the crystallographic database community might wish to
pursue the possibility of establishing one or more such databases. (This is,
of course, not a direct charge on the Working Group!)

>From David's paper it seems quite possible that more than one database
maintainer might wish to establish a database of phases; in which case
"external identifiers" could follow either or both of two routes. One is the
assignment of database-specific identifiers, e.g. _phase_identifier_ccdc and
_phase_identifier_nist; the other is participation in a common labelling
scheme based on a digital object identifier (DOI: see http://www.doi.org).
For such a common scheme to work effectively, a resolver service would be
needed to divert attempts to access a DOI to the particular database
provider responsible for generating that DOI. At present this is purely a
hypothetical (and possibly Utopian) ideal of database interoperability, but
it's an idea I would like to pursue in more general terms with the IUCr
database committee and with interested parties in CODATA.


The IUPAC Chemical Identifier (IChI)
====================================
IUPAC has been working on an identifier for chemical structures that has
also had to address many of the same issues. Some background information on
the IChI project and information on how to obtain a copy of the beta test
software is at
    http://www.iupac.org/symposia/conferences/CIandXML_jul02/index.html
I attach below the IChI for guanine, along with the following observations:

(1) The current representation is XML, but does not specify a DTD. This is
    a working convenience, and again abstracts the content from any specific
    typographic or other concrete realisation. It may well be that the
    final release version will indeed use XML (in which case a specific public
    DTD would then be needed). Equally, different representations may be used
    according as the identifier is to be typeset, stored in a
    machine-readable file or otherwise displayed.

(2) The full identifier is built of layers (basic structure, stereochemistry,
    isotopy, tautomerism and overall charge) which allow partial or
    tentative identification, or database searches with selective information.

(3) The identifier is what David calls "internal": it can be assigned
    strictly from knowledge of the chemical structure itself. What remains
    to be demonstrated is the extent to which it is unique and unambiguous.
    The project leaders are confident that it has both these properties, at
    least for a very wide range of "normal" compounds. The test will come
    as they seek to encompass the "abnormal" ones.

(4) Versioning information is carried along as part of the identifier
    "metadata".


guanine
-------
 <structure number="1" id.name="" id.value="">

  <identifier version="0.9Beta" tautomeric="0">
   <basic>N2OC1NNN1N1C*4, 4-3 6-3 8-1-5-7 9-2-7 10-4-5 11-6-9-10</basic>
   <charge></charge>
   <stereo>
    <dbond>4-3- 8-5- 11-10-</dbond>
    <sp3></sp3>
   </stereo>
  </identifier>

  <identifier.auxiliary-info version="0.9Beta" tautomeric="0">
   <!-- Auxiliary info is not a part of the identifier, it is not unique -->
   <atom.orig-nbr>4 7 11 10 3 9 2 1 5 8 6</atom.orig-nbr>
   <atom.equivalence></atom.equivalence>
  </identifier.auxiliary-info>

  <identifier version="0.9Beta" tautomeric="1">
   <basic>NOC1N*4C*4, 4-3 5-3 8-1-6-7 9-2-6 10-4-7 11-5-9-10, (H4 1 2 4 5 6 7)</basic>
   <charge></charge>
   <stereo>
    <dbond>8-6- 8-7- 9-6- 10-7- 11-9- 11-10-</dbond>
    <sp3></sp3>
   </stereo>
  </identifier>

  <identifier.auxiliary-info version="0.9Beta" tautomeric="1">
   <!-- Auxiliary info is not a part of the identifier, it is not unique -->
   <atom.orig-nbr>4 7 11 10 9 2 3 1 5 8 6</atom.orig-nbr>
   <atom.equivalence></atom.equivalence>
   <tgroup.equivalence></tgroup.equivalence>
  </identifier.auxiliary-info>
 </structure>


Regards
Brian
_______________________________________________________________________________
Brian McMahon                                             tel: +44 1244 342878
Research and Development Officer                          fax: +44 1244 314888
International Union of Crystallography                  e-mail:  bm@iucr.ac.uk
5 Abbey Square, Chester CH1 2HU, England                         bm@iucr.org

Reply to: [list | sender only]


Copyright © International Union of Crystallography

IUCr Webmaster