This library, *libcifpp*, is a generic *CIF* library with some specific additions to work with *mmCIF* files. The main focus of this library is to make sure that files read or written are valid. That is, they are syntactically valid *and* their content is valid with respect to a CIF dictionary, if such a dictionary is available and specified.
Reading a file is as simple as:
.. code-block:: cpp
#include <cif++.hpp>
cif::file f("/path/to/file.cif");
The file may also be compressed using *gzip* which is detected automatically.
Writing out the file again is also simple, to write out the terminal you can do:
.. code-block:: cpp
std::cout << f;
// or
f.save(std::cout);
// or write a compressed file using gzip compression:
f.save("/tmp/f.cif.gz");
CIF files contain one or more datablocks. To print out the names of all datablocks in our file:
.. code-block:: cpp
for (auto &db : f)
std::cout << db.name() << '\n';
Most often *libcifpp* is used to read in structure files in mmCIF format. These files only contain one datablock and so you can safely use code like this:
.. code-block:: cpp
// get a reference to the first datablock in f
auto &db = f.front();
But if you know the name of the datablock, this also works:
.. code-block:: cpp
// get a reference to the datablock name '1CBS'
auto &db = f["1CBS"];
Now, each datablock contains categories. To print out all their names:
.. code-block:: cpp
for (auto &cat : db)
std::cout << cat.name() << '\n';
But you probably know what category you need to use, so lets fetch it by name:
.. _atom_site-label:
.. code-block:: cpp
// get a reference to the atom_site category in db
auto &atom_site = db["atom_site"];
// and make sure there's some data in it:
assert(not atom_site.empty());
.. note::
Note that we omit the leading underscore in the name of the category here.
Categories contain rows of data and each row has fields or items. Referencing a row in a category results in a :cpp:class:`cif::row_handle` object which you can use to request or manipulate item data.
.. code-block:: cpp
// Get the first row in atom_site
auto rh = atom_site.front();
// Get the label_atom_id value from this row handle as a std::string
// Get the x, y and z coordinates using structered binding
const auto &[x, y, z] = rh.get<float,float,float>("Cartn_x", "Cartn_y", "Cartn_z");
// Assign a new value to the x coordinate or our atom
rh["Cartn_x"] = x + 1;
Querying
--------
Walking over the rows in a category is often not very useful. More often you are interested in specific rows in a category. The function :cpp:func:`cif::category::find` and friends are here to help.
What these functions have in common is that they return data based on a query implemented by :cpp:class:`cif::condition`. These condition objects are built in code using regular C++ syntax. The most basic example of a query is:
.. code-block:: cpp
cif::condition c = cif::key("id") == 1;
Here the condition is that all rows returned should have a value of 1 in there item named *id*. Likewise you can use other data types and even combine those. Oh, and I said we use regular C++ syntax for conditions, so you may as well use other operators to compare values:
.. code-block:: cpp
// condition for C-alpha atoms having an occupancy less than 1.0
cif::condition c = cif::key("occupancy") < 1.0f and cif::key("label_atom_id") == "CA";
Using the namespace *cif::literals* that code becomes a little less verbose:
.. code-block:: cpp
using namespace cif::literals;
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
Conditions can also be combined:
.. code-block:: cpp
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";
// extend the condition by requiring the compound ID to be unequal to PRO
c = std::move(c) and "label_comp_id"_key != "PRO";
.. note::
Note the use of std::move here.
Using queries constructed in this way is simple:
.. code-block:: cpp
cif::condition c = ...
auto result = atom_site.find(std::move(c));
// or construct a condition inline:
auto result = atom_site.find("label_atom_id"_key == "CA");
In the example above the result is a range of :cpp:class:`cif::row_handle` objects. Often, using individual field values is more useful:
.. code-block:: cpp
// Requesting a single item:
for (auto id : atom_site.find<std::string>("label_atom_id"_key == "CA", "id"))
std::cout << "ID for CA: " << id << '\n';
// Requesting multiple items:
for (const auto &[id, x, y, z] : atom_site.find<std::string,float,float,float>("label_atom_id"_key == "CA",
"id", "Cartn_x", "Cartn_y", "Cartn_z"))
{
std::cout << "Atom " << id << " is at [" << x << ", " << y << ", " z << "]\n";
}
Returning a complete set if often not required, if you only want to have the first you can use :cpp:func:`cif::category::find_first` as shown here:
// If you're not sure the row exists, use std::optional
auto v2 = atom_site.find_first<std::optional<std::string>>("label_atom_id"_key == "CA", "id");
if (v2.has_value())
...
There are cases when you really need exactly one result. The :cpp:func:`cif::category::find1` can be used in that case, it will throw an exception if the query does not result in exactly one row.
Validation
----------
CIF files can have a dictionary attached. And based on such a dictionary a :cpp:class:`cif::validator` object can be constructed which in turn can be used to validate the content of the file.
A simple case:
.. code-block:: cpp
#include <cif++.hpp>
cif::file f("1cbs.cif.gz");
f.load_dictionary("mmcif_pdbx");
if (not f.is_valid())
std::cout << "This file is not valid\n";
If you want to know why it is not valid, you should set the global variable :cpp:var:`cif::VERBOSE` to something higer than zero. Depending on the value more or less diagnostic output is sent to std::cerr.
In the case above we load a dictionary based on its name. You can of course also load dictionaries based on a specific file, that's a bit more work:
auto &validator = cif::parse_dictionary("my-dictionary", dictFile);
cif::file f("1cbs.cif.gz");
// assign the validator
f.set_validator(&validator);
// alternatively, load it by name
f.load_dictionary("my-dictionary");
if (not f.is_valid())
std::cout << "This file is not valid\n";
Creating your own dictionary is a lot of work, especially if you are only extending an existing dictionary with a couple of new categories or items. So, what you can do is extend a loaded validator like this (code taken from DSSP):
.. code-block:: cpp
// db is a cif::datablock reference containing an mmCIF file with DSSP annotations
auto &validator = const_cast<cif::validator &>(*db.get_validator());
if (validator.get_validator_for_category("dssp_struct_summary") == nullptr)
{
auto dssp_extension = cif::load_resource("dssp-extension.dic");
In the example above we're loading the data using :doc:`/resources`. See the documentation on that for more information.
If a validator has been assigned to a file, assignments to items are checked for valid data. So the following code will throw an exception (see: :ref:`_atom_site-label`):
.. code-block:: cpp
auto rh = atom_site.front();
rh["Cartn_x"] = "foo";
Linking
-------
Based on information recorded in dictionary files (see :ref:`Validation`) you can locate linked records in parent or child categories.
To make this example not too complex, lets assume the following example file:
.. code-block:: cif
data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap Aap
2 noot Noot
3 mies Mies
loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap 1 'Een dier'
2 aap 2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'
And we have a dictionary containing the following link definition:
.. code-block:: cif
loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_1 1 '_cat_1.name' '_cat_2.name' cat_2
So, there are links between *cat_1* and *cat_2* based on the value in items named *name*. Using this information, we can now locate children and parents:
.. code-block:: cpp
// Assuming the file was loaded in f:
auto &cat1 = f.front()["cat_1"];
auto &cat2 = f.front()["cat_2"];
auto &cat3 = f.front()["cat_3"];
// Loop over all ape's in cat2
for (auto r : cat1.get_children(cat1.find1("name"_key == "aap"), cat2))
std::cout << r.get<std::string>("desc") << '\n';
Updating a value in an item in a parent category will update the corresponding value in all related children:
.. code-block:: cpp
auto r1 = cat1.find1("id"_key == 1);
r1["name"] = "aapje";
auto rs1 = cat2.find("name"_key == "aapje");
assert(rs1.size() == 2);
However, changing a value in a child record will not update the parent. This may result in an invalid file since you may then have a child that has no parent:
.. code-block:: cpp
auto r2 = cat2.find1("id"_key == 3);
r2["name"] = "wim";
assert(f.is_valid() == false);
So you have to fix this yourself by inserting a new item in cat1 with the new value.
.. _splitting-rows:
Another situation is when you change a value in a parent and updating children might introduce a situation where you need to split a child. To give an example, consider this:
.. code-block:: cif
data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap Aap
2 noot Noot
3 mies Mies
loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap 1 'Een dier'
2 aap 2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'
loop_
_cat_3.id
_cat_3.name
_cat_3.num
1 aap 1
2 aap 2
And we have a dictionary containing the following link definition (reversed compared to the previous example):
.. code-block:: cif
loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_2 1 '_cat_2.name' '_cat_1.name' cat_1
cat_3 1 '_cat_3.name' '_cat_2.name' cat_2
cat_3 1 '_cat_3.num' '_cat_2.num' cat_2
So *cat3* is a parent of *cat2* and *cat2* is a parent of *cat1*. Now, if you change the *name* value of the first row of *cat3* to 'aapje', the corresponding row in *cat2* is updated as well. But when you update *cat2* you have to update *cat1* too. And simply changing the name field in row 1 of *cat1* is wrong. The default behaviour in libcifpp is to split the record in *cat1* and have a new child with the new name whereas the other remains as is.
Information on 3D structures of proteins originally came formatted in `PDB <http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html>`_ files. Although the specification for this format had some real restrictions like a mandatory HEADER and CRYST line, many programs implemented this very poorly often writing out only ATOM records. And users became used to this.
The PDB format has some severe limitations rendering it useless for all but very small protein structures. A new format called `mmCIF <https://mmcif.wwpdb.org/>`_ has been around for decades and now is the default format for the Protein Data Bank.
The legacy PDB format has some severe limitations rendering it useless for all but very small protein structures. A new format called `mmCIF <https://mmcif.wwpdb.org/>`_ has been around for decades and now is the default format for the Protein Data Bank.
The software developed in the `PDB-REDO <https://pdb-redo.eu/>`_ project aims at improving 3D models based on original experimental data. For this, the tools need to be able to work with both PDB and mmCIF files. A decision was made to make mmCIF leading internally in all programs and convert PDB directly into mmCIF before processing the data. A robust conversion had to be developed to make this possible since, as noted above, files can come with more or less information making it sometimes needed to do a sequence alignment to find out the exact residue numbers.
The software developed in the `PDB-REDO <https://pdb-redo.eu/>`_ project aims at improving 3D models based on original experimental data. For this, the tools need to be able to work with both legacy PDB and mmCIF files. A decision was made to make mmCIF leading internally in all programs and convert legacy PDB directly into mmCIF before processing the data. A robust conversion had to be developed to make this possible since, as noted above, files can come with more or less information making it sometimes needed to do a sequence alignment to find out the exact residue numbers.
And so libcif++ came to life, a library to work with mmCIF files. Work on this library started early 2017 and has developed quite a bit since then. To reduce dependency on other libraries, some functionality was added that is not strictly related to reading and writing mmCIF files but may be useful nonetheless. This is mostly code that is used in 3D calculations and symmetry operations.
...
...
@@ -18,16 +18,26 @@ The main part of the library is a set of classes that work with mmCIF files. The
* :cpp:class:`cif::datablock`
* :cpp:class:`cif::category`
The :cpp:class:`cif::file` class encapsulates, you guessed it, the contents of a mmCIF file. In such a file there are one or more :cpp:class:`cif::datablock`s and each datablock contains one or more :cpp:class:`cif::category`s.
The :cpp:class:`cif::file` class encapsulates the contents of a mmCIF file. In such a file there are one or more :cpp:class:`cif::datablock` objects and each datablock contains one or more :cpp:class:`cif::category` objects.
Synopsis
--------
Using *libcifpp* is easy, if you are familiar with modern C++:
Although not really a core *CIF* functionality, when working with *mmCIF* files you often need to work with symmetry information. And symmetry works on points in a certain space and thus geometry calculations are also something you need often. Former versions of *libcifpp* used to use `clipper <http://www.ysbl.york.ac.uk/~cowtan/clipper/doc/index.html>`_ to do many of these calculations, but that introduces a dependency and besides, the way clipper numbers symmetry operations is not completely compatible with the way this is done in the PDB.
Points
------
The most basic type in use is :cpp:type:`cif::point`. It can be thought of as a point in space with three coordinates, but it is also often used as a vector in 3d space. To keep the interface simple there's no separate vector type.
Many functions are available in :ref:`file_cif++_point.hpp` that work on points. There are functions to calculate the :cpp:func:`cif::distance` between two points and also function to calculate dot products, cross products and dihedral angles between sets of points.
Quaternions
-----------
All operations inside *libcifpp* that perform some kind of rotation use :cpp:type:`cif::quaternion`. The reason to use Quaternions is not only that they are cool, they are faster than multiplying with a matrix and the results also suffer less from numerical instability.
Matrix
------
Although Quaternions are the preferred way of doing rotations, not every manipulation is a rotation and thus we need a matrix class as well. Matrices and their operations are encoded as matrix_expressions in *libcifpp* allowing the compiler to generate very fast code. See the :ref:`file_cif++_matrix.hpp` for what is on offer.
Symmetry operations
-------------------
Each basic symmetry operation in the crystallographic world consists of a matrix multiplication followed by a translation. To apply such an operation on a carthesian coordinate you first have to convert the point into a fractional coordinate with respect to the unit cell of the crystal, then apply the matrix and translation operations and then convert the result back into carthesian coordinates. This is all done by the proper routines in *libcifpp*.
Symmetry operations are encoded as a string in *mmCIF* PDBx files. The format is a string with the rotational number followed by an underscore and then the encoded translation in each direction where 5 means no translation. So, the identity operator is ``1_555`` meaning that we have rotational number 1 (which is always the identity rotation, point multiplied with the identity matrix) and a translation of zero in each direction.
To give an idea how this works, here's a piece of code copied from one of the unit tests in *libcifpp*. It takes the *struct_conn* records in a certain PDB file and checks wether the distances in each row correspond to what we can calculate.