[IUCr Home Page]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Note on XML in Chemistry and Chemical Identifiers

Dear David:

The following note in the current issue of "Chemistry International"
published by IUPAC will be of interest to our Working Group:

                                                                XML in
Chemistry and Chemical Identifiers

Antony (Tony) N. Davies

Steve Stein of the National Institute of Standards and Technology (NIST) in
Gaithersburg, Maryland, USA, and Alan McNaught of the Royal Society of
Chemistry, Cambridge, UK, jointly hosted a three-day meeting to discuss
IUPAC projects on XML in Chemistry and the Chemical Identifier Project. The
meeting was held at NIST from 12–14 November 2003.

The meeting was exceptionally well attended with over 50 attendees from
governmental and regulatory bodies, research and academic institutes, and
industry. A wide range of experts in the field were brought together for a
lively exchange of views on many of the topics covered.

XML in Chemistry
Numerous speakers related tales of XML initiatives involving chemistry in
their respective organizations, including the European Patent Office, the
International Union of Crystallography, and the U.S. Food and Drug
Administration’s Center for Drug Evaluation and Research. Various projects
within NIST itself were also discussed, such as UnitsML for scientific units
and ThermoML for thermodynamic properties. ToxML was described for
toxicology data. Despite the range of speakers’ views on the issue of XML in
chemistry, one thing became clear. The decision of IUPAC to take a leading
role to avoid multiplication of effort was clearly correct.

Some very detailed technical discussions were held on the mechanisms
surrounding the generation of controlled ontologies or data dictionaries
that highlighted the speed at which the field is moving. The number of XML
initiatives that have been born, flourished briefly, and then vanished into
obscurity was also discussed.

These arguments underlined the essential nature of the problem, which is
that the research effort ought to be better placed in producing novel ways
to handle information to enhance productivity and produce better more
advanced tools for data mining rather than repeatedly discussing how best to
move the data from A to B. With luck, the IUPAC initiative will bring a
certain degree of stability to the information technology base in chemistry
and allow teams working in this area to concentrate on their core business
without having to worry whether their underlying technology is about to be
made obsolete!

IUPAC/NIST Chemical Identifiers (INChI)
Alan McNaught introduced the project, the aim of which is to produce a
public Chemical Identifier to uniquely identify compounds. The current
version is available for testing and has been expanded to cover organic,
inorganic, and organometallic chemistry. It should be noted that the project
acronym IChI (for IUPAC Chemical Identifiers) has been changed to INChI,
where N stands for NIST. This change was made to recognize the immense
contribution of NIST to the project.

But how does INChI work? Well, INChI starts off by looking at the chemistry
of the structure to be assigned an “Identifier.” The structure is normalized
and a number of chemical rules applied. Next, some mathematics
“canonoicalises” the structure (labels atoms) with equivalent atoms
receiving the same numbers. Finally, the labelled structure is “serialized”
and the output is a character string. Sound simple? Well, as they say in
Germany, the devil hides in the details!

The normalization of the structure involves a series of layers for the raw
chemical substance, the molecular formula, and a connectivity layer followed
where necessary by a stereochemistry and isotopic layer. The connectivity
layer consists of four “sub layers,” with increasing amounts of detail,
generated as follows:

1.   disconnect all H and meta atoms to create a “skeleton”
2.   reconnect fixed hydrogen atoms to reveal tautomers
3.   optionally reconnect all mobile hydrogen atoms
4.   optionally reconnect all metal atoms

As you would expect this very simple approach came in for some heavy
discussion, but “the proof of the pudding is in the eating,” as they say. So
far, with some very large structural databases being analyzed in this way,
no insurmountable problems have arisen. The developers are looking for beta
testers so please get in touch through the IUPAC Web site if you are

Prof. S. C. Abrahams
Physics Department
Southern Oregon University
Ashland, OR 97520

Fax: (541) 552-6415    Tel. (541) 482-7942
Email: sca@mind.net

phase-identifiers mailing list

Reply to: [list | sender only]

Copyright © International Union of Crystallography

IUCr Webmaster