Information on 3D structures of proteins originally came formatted in `PDB <http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html>`_ files. Although the specification for this format had some real restrictions like a mandatory HEADER and CRYST line, many programs implemented this very poorly often writing out only ATOM records. And users became used to this.
The PDB format has some severe limitations rendering it useless for all but very small protein structures. A new format called `mmCIF <https://mmcif.wwpdb.org/>`_ has been around for decades and now is the default format for the Protein Data Bank.
The legacy PDB format has some severe limitations rendering it useless for all but very small protein structures. A new format called `mmCIF <https://mmcif.wwpdb.org/>`_ has been around for decades and now is the default format for the Protein Data Bank.
The software developed in the `PDB-REDO <https://pdb-redo.eu/>`_ project aims at improving 3D models based on original experimental data. For this, the tools need to be able to work with both PDB and mmCIF files. A decision was made to make mmCIF leading internally in all programs and convert PDB directly into mmCIF before processing the data. A robust conversion had to be developed to make this possible since, as noted above, files can come with more or less information making it sometimes needed to do a sequence alignment to find out the exact residue numbers.
The software developed in the `PDB-REDO <https://pdb-redo.eu/>`_ project aims at improving 3D models based on original experimental data. For this, the tools need to be able to work with both legacy PDB and mmCIF files. A decision was made to make mmCIF leading internally in all programs and convert legacy PDB directly into mmCIF before processing the data. A robust conversion had to be developed to make this possible since, as noted above, files can come with more or less information making it sometimes needed to do a sequence alignment to find out the exact residue numbers.
And so libcif++ came to life, a library to work with mmCIF files. Work on this library started early 2017 and has developed quite a bit since then. To reduce dependency on other libraries, some functionality was added that is not strictly related to reading and writing mmCIF files but may be useful nonetheless. This is mostly code that is used in 3D calculations and symmetry operations.
...
...
@@ -18,16 +18,26 @@ The main part of the library is a set of classes that work with mmCIF files. The
* :cpp:class:`cif::datablock`
* :cpp:class:`cif::category`
The :cpp:class:`cif::file` class encapsulates, you guessed it, the contents of a mmCIF file. In such a file there are one or more :cpp:class:`cif::datablock`s and each datablock contains one or more :cpp:class:`cif::category`s.
The :cpp:class:`cif::file` class encapsulates the contents of a mmCIF file. In such a file there are one or more :cpp:class:`cif::datablock` objects and each datablock contains one or more :cpp:class:`cif::category` objects.
Synopsis
--------
Using *libcifpp* is easy, if you are familiar with modern C++:
Although not really a core *CIF* functionality, when working with *mmCIF* files you often need to work with symmetry information. And symmetry works on points in a certain space and thus geometry calculations are also something you need often. Former versions of *libcifpp* used to use `clipper <http://www.ysbl.york.ac.uk/~cowtan/clipper/doc/index.html>`_ to do many of these calculations, but that introduces a dependency and besides, the way clipper numbers symmetry operations is not completely compatible with the way this is done in the PDB.
Points
------
The most basic type in use is :cpp:type:`cif::point`. It can be thought of as a point in space with three coordinates, but it is also often used as a vector in 3d space. To keep the interface simple there's no separate vector type.
Many functions are available in :ref:`file_cif++_point.hpp` that work on points. There are functions to calculate the :cpp:func:`cif::distance` between two points and also function to calculate dot products, cross products and dihedral angles between sets of points.
Quaternions
-----------
All operations inside *libcifpp* that perform some kind of rotation use :cpp:type:`cif::quaternion`. The reason to use Quaternions is not only that they are cool, they are faster than multiplying with a matrix and the results also suffer less from numerical instability.
Matrix
------
Although Quaternions are the preferred way of doing rotations, not every manipulation is a rotation and thus we need a matrix class as well. Matrices and their operations are encoded as matrix_expressions in *libcifpp* allowing the compiler to generate very fast code. See the :ref:`file_cif++_matrix.hpp` for what is on offer.
Symmetry operations
-------------------
Each basic symmetry operation in the crystallographic world consists of a matrix multiplication followed by a translation. To apply such an operation on a carthesian coordinate you first have to convert the point into a fractional coordinate with respect to the unit cell of the crystal, then apply the matrix and translation operations and then convert the result back into carthesian coordinates. This is all done by the proper routines in *libcifpp*.
Symmetry operations are encoded as a string in *mmCIF* PDBx files. The format is a string with the rotational number followed by an underscore and then the encoded translation in each direction where 5 means no translation. So, the identity operator is ``1_555`` meaning that we have rotational number 1 (which is always the identity rotation, point multiplied with the identity matrix) and a translation of zero in each direction.
To give an idea how this works, here's a piece of code copied from one of the unit tests in *libcifpp*. It takes the *struct_conn* records in a certain PDB file and checks wether the distances in each row correspond to what we can calculate.