If you like the basic approach, thank Phil Bourne.
He did the real work of creating
pdb2cif. If you have problems with the adaptation to
cif_mm.dic or any other aspects of pdb2cif, send
Philip E. Bourne,
San Diego Supercomputer Center,
PO Box 85608, San Diego, CA 92186-9785 USA
Herbert J. Bernstein,
P.O. Box 177, Bellport, NY 11713
Frances C. Bernstein,
P.O. Box 177, Bellport, NY 11713
Current versions are available via http from:
It is available as a compressed shar pdb2cif.shar.Z (2.8 megabytes), a compressed C-shell shar pdb2cif.cshar.Z (2.8 megabytes) or as individual files, as given in the MANIFEST.
If your system cannot handle a Unix-style compressed file, you may wish to download an uncompressed shar pdb2cif.shar or an uncompressed cshar pdb2cif.cshar.
If you need a later version, and are willing to work with code that is changing, you may with to try the next_test_version (not always present)
Release 2.3.6 corrected some comments and documentation.
Release 2.3.5 corrected the handling of long references and JRNL PUBL records. Residue names which had been quoted with a single quote mark are now quoted with a double quote mark.
Release 2.3.4 corrected the handling of two blank fields in SEQADV and some typos in STRUCT_MON_PROT tags.
Release 18.104.22.168 corrected a spurious header generated in the CIF when a PDB entry has SSBOND records and no secondary structure.
Release 22.214.171.124 was a minor revision to the web pages of version 2.3.3. URLs in comments in the program were also updated. Changes were made in the m4 script for the gnu m4 handling of format.
Release 2.3.3 was an interim revision to pdb2cif to support the changes in tokens introduced with the mmCIF dictionary 0.8.10. The only change done at this stage was to remap the names currently in use. Additional changes will be needed in the future to support parsing to make use of the additional tokens.
Release 2.3.2 has several changes for compliance with the mmCIF dictionary version 0.8.02, in response to some problems discovered by John Westbrook and the checking provided by his ciflib routines. The most visible changes are the listing of the standard residues used in an entry in the CHEM_COMP category, changing use of a quoted blank field as a value for _atom_site.auth_asym_id to a period, and moving some data items common to a loop into the loop itself.
Release 2.3.1 corrects some minor problems in release 2.3.0. In particular a problem with a bad item count and a bad date on machines running some older versions of perl has been corrected. Extra warnings for NMR entries with unusual uses of B-values or occupancies have been added.
Release 2.3.0 was an update to Release 2.2.7 correcting some minor problems with data item types, long publication names, and a failure to report CSD codens.
Release 2.2.7 was the first pdb2cif release after PDB entries compliant with the February 1996 V2.0 PDB format became available. The format of data items in ATOM_SITE lists derived from V2.0 entries was corrected, and the mapping of HETNAM and HETSYN moved from ENTITY_NAME_SYS to ENTITY_NAME_COM.
For more information and prior revisions, see CHANGES .
The following definitions would have to be appended to the mmCIF dictionary for validation of pdb2cif output:
save__struct_conn.ptnr1_atom_site_id _item_description.description ; The id of an atom site for the first partner in a bond This data item is a pointer to _atom_site.id in the ATOM_SITE category. ; _item.name '_struct_conn.ptnr1_atom_site_id' _item.mandatory_code no _item.category_id struct_conn _item_linked.child_name '_struct_conn.ptnr1_atom_site_id' _item_linked.parent_name '_atom_site.id' save_ save__struct_conn.ptnr2_atom_site_id _item_description.description ; The id of an atom site for the second partner in a bond This data item is a pointer to _atom_site.id in the ATOM_SITE category. ; _item.name '_struct_conn.ptnr2_atom_site_id' _item.mandatory_code no _item.category_id struct_conn _item_linked.child_name '_struct_conn.ptnr2_atom_site_id' _item_linked.parent_name '_atom_site.id' save_ save__atom_site.label_model_id _item_description.description ; A component of the macromolecular identifier for this atom site. The value of _atom_site.label_model_id associates the atom site with a particular nmr model. ; _item.name '_atom_site.label_model_id' _item.mandatory_code no _item.category_id 'atom_site' _item_type.code code loop_ _item_linked.child_name _item_linked.parent_name '_struct_mon_prot.label_model_id' '_atom_site.label_model_id' '_struct_mon_prot_cis.label_model_id' '_atom_site.label_model_id' save_ save__struct_mon_prot.label_model_id _item_description.description ; This data item is a pointer to _atom_site.label_model_id in the ATOM_SITE category. ; _item.name '_struct_mon_prot.label_model_id' _item.mandatory_code no _item.category_id struct_mon_prot save_ save__struct_mon_prot_cis.label_model_id _item_description.description ; This data item is a pointer to _atom_site.label_model_id in the ATOM_SITE category. ; _item.name '_struct_mon_prot_cis.label_model_id' _item.mandatory_code no _item.category_id struct_mon_prot_cis save_ save__struct_ref_seq_dif.db_seq_num _item_description.description ; The sequence position in the referenced database entry corresponding to this point difference position. The use of . for _struct_ref_seq_dif.db_seq_num when a value has been given for _struct_ref_seq_dif.seq_num indicates that there has been an insertion at this position. The use of . for _struct_ref_seq_dif.seq_num when a value is given for _struct_ref_seq_dif.db_seq_num indicates that there has been a deletion at this position. ; _item.name '_struct_ref_seq_dif.db_seq_num' _item.mandatory_code no _item.category_id struct_ref_seq_dif loop_ _item_range.maximum _item_range.minimum . 1 1 1 _item_type.code int save_
This program produces summary warnings as comments at the end of each output CIF. Each diagnostic begins with the string "#=#", so that a summary may be extracted using grep. Unconverted records are captured in the AUDIT category warnings and uncoverted records should be examined carefully.
COMPND, SOURCE, TITLE and CAVEAT
are merged into _struct.title without further parsing. A great deal of information
could be derived from the entries which use the PDB 1995 format description when
sufficient information for mapping of MOL_ID to entities is available.
REMARK records currently are mapped without parsing. There is a great deal of information in these records which can be parsed in more recent entries. It should be noted that only columns 12-70 of REMARKs are mapped to mmCIF.
EXPDTA records use values which do not have a direct mapping to enumerated values for _explt.method
ATOM/HETATM records in newer PDB entries have a field for the XPLOR segment id. The field is mapped to _atom_site.auth_asym_id, but the data type used in the dictionary does not permit embedded blanks, which may occur in the field. The problem is side-stepped for totally blank fields by mapping them to a period.
Additional data items for categories like _struct_topol will
need to be added as they evolve.
The output produced is in fairly close compliance with mmCIF 0.8.2. However, we have introduced a few additional tokens via the PUBL_MANUSCRIPT_INCL category. Though we have done so in a manner which conforms to the example in the dictionary, the result is not, strictly speaking, proper syntax, since the entry_id which is the category key, is given outside the loop.
The definitive documentation of the program is, of course, the program itself. However, for those interested in the background relationship between between the PDB format and mmCIF, we have included a partial concordance.
This program is distributed as an m4 macro script
"pdb2cif.m4" from which three executable
scripts have been made:
A makefile is provided
to show how the executable scripts were made, but you need not rebuild
them. They are current. If you attempt to
rebuild the perl script you may have difficulty
with the awk to perl conversion program a2p,
which fails for this script on many
systems. A properly configured a2p is
provided in the distribution directory
If your system is sufficiently similar to ours, then you may be able to install the program simply by making one of the three versions executable:
On most unix systems, you can make the script
into an executable program by executing
one of the following sets of commands, depending
on whether you want the perl, awk,
or old-awk version to be pdb2cif:
chmod 755 pdb2cif.pl
ln -s pdb2cif.pl pdb2cif
chmod 755 pdb2cif.oawk
ln -s pdb2cif.oawk pdb2cif
chmod 755 pdb2cif.awk
ln -s pdb2cif.awk pdb2cif
after which pdb2cif may be executed directly.
On some systems, you may need to use "gawk"
instead of "awk". pdb2cif.awk uses
features which are _not_ found in the original
Aho, Kernighan, Weinberger, "Awk -
a pattern scanning and processing language,"
but which have since been added on most
systems: functions and the call to "system".
If the use of function or system generates a
syntax error, you may wish to obtain the gnu version of
awk, "gawk", to be able to
run pdb2cif. The other system dependency you may have
is in the use of a system call
to "date". Some systems do not support
the 4-digit year format code %Y, and others do not
support format codes at all. In the first case,
you can change the %Y to 19%y (just
remember to fix this in the year 2000), but
in the second case, you should just comment out the offending call.
The call is marked with a WARNING comment in the m4 script.
If your system is different, you may have to rebuild from the pdb2cif.m4. You do this with the program make and Makefile. The first thing you need to know is where you have a working version of perl or gnu-awk. Edit Makefile to show the correct path to at least one of them. Be warned that rebuilding the perl version from a standard perl release may fail. Before you do so, you may wish to save pdb2cif.pl elsewhere. If you have a good verion of perl with a version of the utility a2p built with a very large OPSMAX, then execute the command
If you have a good version of gnu-awk, then execute the command
You can test your installation with
The operation of this program is controlled
by the following flags, which may be set
by statements of the form
#define variable value
in the entry or by including header files with definitions in the list of arguments
before the entry.
The following flag is used to produce a more complete CIF entry, i.e. data items are
given, but with the value "?".
#define verbose [yes|no]
where "yes" implies verbose output.
The following flag controls conversion of text fields using the type-setting codes
used in some PDB entries
#define convtext [yes|no]
where "yes" implies the use of the 1992 PDB format description typesetting conventions.
The following flags control conversion of author and editor names
#define auth_convtext [yes|conditional|no]
#define junior_on_last [yes|no]
where "yes" for auth_convtext implies
the conversion of names independent of the setting
of convtext, "conditional" implies
"yes" only if convtext is "yes" and "no" means
to pass through the PDB style name unchanged.
If conversion is done, then "yes" for
junior_on_last will follow the COMCIFs convention
of keeping "dynastic" modifers, such
as "Junior," "Senior,"
"II," etc with the family name. The typesetting used differs
slightly from the 1992 PDB format description,
by forcing capitalization after "'"
and "-". If the translations done
are not satisfactory, special cases may be handled
#define name PDB_form name_value
where the PDB_form is the form of the name expected in the PDB and name_value is the
form to be used by this program. All blanks in either form must be replaced by "_".
For example, you can give the following
#define name E.F.MEYER_JUNIOR Meyer Junior,_E.F.
If the same name is defined multiple times, only the last translation given will be
used. The PDB_form is not case-sensitive, but the name_value is.
The following flag controls the distribution of label_seq_id
to all atom site lines. Select the value "yes"
if you do _not_ want this distribution done, but want denser atom lists
#define dense_list [yes|no]
The following flag controls the printing of TER records
#define print_ter [yes|no]
You should put any flag definitions that will
be used for most entries into a file
(a sample is included in the distribution directory),
and any definitions required
by a particular entry into an file with the name of the
entry and the extension "pdbh".
The program and the header files should be
in your current working directory. If
you wish, you may put the program into
another directory and modify your path to point
to it, but the header files must be local,
or you will need to give rooted paths
for each of them.
Then you can convert a single file named entry.ent by excuting
pdb2cif default.pdbh entry.pdbh entry.ent > entry.cif
pdb2cif default.pdbh 4ins.pdbh 4ins.ent > 4ins.cif
To run with a directory of pdb files such that *.ent -> *.cif:
foreach i (*.ent)
set head = ($i:r)
pdb2cif default.pdbh $head.pdbh $i > $head.cif
If you are reading the m4 script,
please note the macro definitions
used for the build. If you modify this program, please note the following:
You cannot use the m4 substr or index
The quotation marks used are: \036 and \037
The version for PERL is obtained by defining "PERL"
Do not use "split" directly; use "dosplit"
Defining "NOLOWER" replaces calls to the built-in "tolower" or "toupper" with loops
Defining "NOFUNCS" caused the functions we define to be expanded in-line
Defining "BADSPLIT" includes code to correct for a PERL field miscount