.. _page-running: Generating a ChIMES model ======================================================== This section assumes the goal of obtaining an atomistic ChIMES model from Kohn-Sham density functional theory (DFT) molecular dynamics (MD) generated reference data. However, ChIMES models can be fit to a broad variety of reference calculations. For additional information see :ref:`Citing ChIMES `. At its simplest, ChIMES models are generated by per-atom force matching to DFT reference data, and the ChIMES Parameter Generator, ``chimes_lsq``, aims to minimize the following objective function: .. math:: :nowrap: \begin{equation} F_{\mathrm{obj}} = \frac{1}{n_\mathrm{f}3n_{\mathrm{a}}} \sum^{n_\mathrm{f}}_{i=1} \left[ \sum^{n_\mathrm{a}}_{j=1} \sum^{3}_{k=1} w^2_{\mathrm{F}_{ijk}}(\Delta F_{ijk})^2 \right], \end{equation} where :math:`\Delta F = F^{\mathrm{DFT}} - F^{\mathrm{ChIMES\{c\}}}`. :math:`F_{\mathrm{obj}}`, and :math:`\{c\}` are the weighted root-mean-squared error (RMSE) and model coefficients, respectively, and the number of frames and atoms are given by :math:`n_\mathrm{f}` and :math:`n_\mathrm{a}`, respectively. :math:`F_{ijk}` indicates the :math:`k^{\mathrm{th}}` Cartesian component of the force on atom :math:`j` in configuration :math:`i`. The superscripts "ChIMES" and "DFT" indicate forces/energies predicted from the force-matched model and the DFT-MD training trajectory, respectively. Because ChIMES is entirely linear it its fitted parameters, the above optimization problem can be recast as the following over-determined matrix equation: .. math:: :nowrap: \begin{equation} \mathrm{\mathbf{wMc}} = \mathrm{\mathbf{wb}_{DFT}}, \end{equation} where :math:`\mathrm{\mathbf{b}_{DFT}}` is the vector of DFT-generated per-atom force components, :math:`F^{\mathrm{DFT}}_{ijk}`, :math:`\mathrm{\mathbf{w}}` is a diagonal matrix of weights to be applied to the elements of :math:`\mathrm{\mathbf{b}_{DFT}}` and rows of :math:`\mathrm{\mathbf{M}}`, :math:`\mathrm{\mathbf{c}}` is the diagonal matrix of generated model parameters, and :math:`\mathrm{\mathbf{M}}` is the overdetermined ChIMES design matrix for which elements of :math:`\mathrm{\mathbf{M}}` are given by: .. math:: :nowrap: \begin{equation} M _{a,b} = \frac{\partial X _{a,\mathrm{ChIMES} \{c\}}}{\partial c_{b}} \end{equation} Please note that the matrix :math:`\mathrm{\mathbf{M}}` is referred to as matrix :math:`\mathrm{\mathbf{A}}` in the software and reported as ``A.txt``. As discussed in further detail below, we note that per-configuration energies and stress tensors can be considered in ChIMES fits, in additional to per-atom forces. Additional details are provided below in the following sections: * :ref:`Essential workflow ` * :ref:`Input files and options ` * :ref:`Running ` * :ref:`Solving for ChIMES parameters ` * :ref:`Caveats ` --------------- .. _sec-workflow: Essential workflow ---------------------- The basic ChIMES model development workflow comprises 5 main steps, briefly outlined below. Note: 1. Reference data preparation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * DFT-MD simulations are run for the target material over the target thermodynamic condition(s) * Per-atom forces and optionally the system energy and stress tensor are extracted from the trajectory and saved in a "training trajectory" file, in a ``chimes_lsq`` compatible format 2. Hyperparameter selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * ChIMES model hyperparameters (e.g. bodiedness, polynomial orders, cutoffs, transformation function parameters, etc.) are specified in the ``chimes_lsq`` input file (e.g., typically named ``fm_setup.in``), along with the training trajectory file 3. Design matrix generation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * ``chimes_lsq`` is executed to produce the design matrix (``A.txt``), the vector of forces (and optionally energies and stresses) from DFT (``b.txt``), and other metadata files (e.g. ``params.header``, ``ff_groups.map``) 4. Design matrix optimization ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * ChIMES paramteters are generated via the ``chimes_lsq.py`` utility, which also allows for consideration of an optional weight 5. Performing calculations with ChIMES ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * ChIMES calculations and simulations are performed through the `ChIMES Calculator `_, which can be linked to a variety of simulation software packages --------------- .. _sec-input_files: Input files and options -------------------------------------------- Note: Asterisks (*) indicate options described in greater detail below ================== ============================================= ==================================== Control variables Description Value/Options/Notes ================== ============================================= ==================================== ``TRJFILE`` * Training trajectory file(s) See below for details. ``WRAPTRJ`` ``true``/``false``: Coorindate wrapping Automatically disabled when `ghost atoms `_ (layers) are used. ``SPLITFI`` ``true``/``false``: {A,b}.txt file splitting Should not be used unless DLARS/DLASSO solvers are used. ``NFRAMES`` Number of training frames Any integer > 0. ``NLAYERS`` Number of supercell ghost layers A value of 0 yields the original box. A value of 1 yields a single shell of replicated boxes around the original box (i.e. 27 boxes). ``FITCOUL`` * ``true``/``false``: Fit/use charges See below for details. ``FITSTRS`` * Whether/how to include stresses See below for details. ``FITENER`` * Whether/how to include energies See below for details. ``PAIRTYP`` Chebyshev polynomial orders Expects `` -1 1``. Setting an order to zero is equivalent to excluding the corresponding interaction. For example, ``12 5 0 -1 1`` is equivalent to excluding 4-body interactions. ``CHBTYPE`` * Pair distance transformation type See below for details. ``USENEIG`` Neighbor list/distance convention Auto-select algorithm (``true``) or use small-cell friendly method (``true SMALL``) ``FITPOVR`` ReaxFF linear overcoordination parameters use Boolean to decide if using ReaxFF linear overcoordination parameters. If this is false and parameters are provided below, those parameters will be subtracted from forces. ``SKPFRMS`` Skip frames to train to Decide to process frames in round-robin order with skipping. Integer input. ================== ============================================= ==================================== Note: Asterisks (*) indicate options described in greater detail below ====================== =========================================== ==================================== Topology variables Description Value/Options/Notes ====================== =========================================== ==================================== ``NATMTYP`` Number of atom types Number of unique atom types in trajectory. ``TYPEIDX`` Atom type index Integers, ranging from 1 to ``NATMTYP``. Values form a column below ``# TYPEIDX #``. Make sure the element order in ``fm_setup.in`` matches with the order in the topological file. ``ATM_TYP`` Atom type chemical symbol Any string. Chemical symbol for each unique atom type. Values form a column below ``# ATM_TYP #``. ``ATMCHRG`` Atom type charge If ``FITCOUL`` is false: a partial atomic charge for each atom type. If ``FITCOUL`` is true: a positive or negative sign, to indicate how pair charge signs should be assigned. If ``ATMCHRG`` is set to zero, ChIMES is fit normally. ``ATMMASS`` Atom type mass Floats > 0. ``EXCLUDE`` * Interaction to exclude/ignore Sets of n-body ``ATM_TYP`` s. Rarely used. ``PAIRIDX`` Pair type index Ascending integers from 1 to the number of unique atom pair types. Values form a column below ``# PAIRIDX #``. ``ATM_TYX`` ``ATM_TYP`` of atom X in pair ``PAIRIDX`` Used to define interaction pair types. Order does not matter. ``S_MINIM`` Inner cutoff for pair ``PAIRIDX`` Generally taken as slightly less than smallest distance sampled in DFT-MD trajectory. ``S_MAXIM`` Outer cutoff for pair ``PAIRIDX`` Should be small enough to prevent self-interaction across periodic boundaries. ``S_DELTA`` Ancillary. Value required Must have floating point number in this location. ``MORSE_LAMBDA`` ``CHBTYPE`` variable Morse-type lambda for Morse ``CHBTYP``. Only used if ``# PAIRTYP #`` is ``CHEBYSHEV``. Generally set to location of first radial distribution peak for each pair type. ``USEOVRP`` ReaxFF overbonding term Boolean. If ``FITPOVR`` is true, will look for ``# PAIRTYP #`` with true. Asks user if they would like to use the ReaxFF overbonding term. Additional parameters required if used. ``NXYBINS`` Ancillary Ancillary ``CHGCONS`` * Charge fitting constraints See below for details. ``SPECIAL XB MINIM`` * Special manybody inner cutoffs See below for details. ``SPECIAL XB MAXIM`` * Special manybody outer cutoffs See below for details. ``FCUTTYP`` * Cutoff function style/parameters See below for details. ====================== =========================================== ==================================== Additional details on: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``TRJFILE`` """""""""""" This keyword provides the name of the simulation trajectory file. Files use a .xyzf format, which is like the standard .xyz format, with two exceptions: (1) the line after that specifying number of atoms will contain information on box dimensions and optionally stress tensor and system energy, and (2) each coordinate line has x, y, and z forces on the corresponding atom appended. Options are: 1. ````: Tell the program to search for a single trajectory file name. 2. ``MULTI ``: Tell the program to expect multiple trajectory files. Here, is the name of a file, e.g. ``traj_list.dat`` structured like: .. code-block:: bash path/to/file-1.xyzf path/to/file-2.xyzf ... path/to/file-n.xyzf Note that, when using the ``MULTI`` option, ``# NFRAMES #`` must be equal to the sum of ```` for each of the ````. As is described in the Output files section below, ``traj_list.dat`` can contain optional decoration which is useful for improving output annotation and/or automating fitting work flows. For example: .. code-block:: bash path/to/file-1.xyzf path/to/file-2.xyzf ... path/to/file-n.xyzf where ````, ````, ````, and ```` represent an optional string ignored by the program, and strings to be prepended to labels in ``b-labeled.txt`` for force, stress, and energy entries, respectively. Note that none of these annotations can contain whitespace, and that annotation order cannot be randomized. Comment lines in trajectory file frames are formatted like: .. code-block:: bash or, if the first word on the line is "NON-ORTHO": .. code-block:: bash NON-ORTHO where :math:`s\_ab` are the :math:`ab` stress tensors and energy is the overall system energy, and :math:`\mathrm{latvec-}i\_a` is the :math:`a^{\rm{th}}` component of the :math:`i^{\rm{th}}` lattice vector. Note that stress tensors and energies are optional and thier inclusion is indicated by ``FITSTRS`` and ``FITENER`` in the ``fm_setup.in`` file. Additional details can be found in the corresponding sections below. ``FITSTRS`` """""""""""" This keyword controls if and how per-configuration stress tensors are included in the fit. Options are: * ``false``: * Do not include any stress tensor information in the fit * ``true``: * Fit to stress tensors for all training trajectory frames, but only expect/consider :math:`xx`, :math:`yy`, and :math:`zz` stress tensor components * ``ALL``: * Fit to stress tensors for all training trajectory frames, and expect/consider all stress tensor components * ``FIRST ``: * Fit to stress tensors for the overall first ```` training trajectory frames, but only expect/consider :math:`xx`, :math:`yy`, and :math:`zz` * ``FIRSTALL ``: * Fit to stress tensors for the overall first ```` training trajectory frames, and expect/consider all stress tensor components ``FITENER`` """""""""""" This keyword controls if and how per-configuration energies are included in the fit. Options are: * ``false``: * Do not include any energy information in the fit * ``true``: * Include/expect an energy for each configuration in the training trajectory(s) * ``FIRST ``: * Include/expect an energy for the overall first ```` configurations in the training trajectory(s) ``FITCOUL`` """""""""""" This keyword defines whether charges should be fit, or held fixed at user-defined values. Note that currently, functionality is only supported when ``FITCOUL`` is true, or when ``FITCOUL`` is false and all charges are zero. If ``FITCOUL`` is false, but charges are non-zero, program will attempt to subtract charge contributions from forces. ``CHBTYPE`` """""""""""" This keyword defined the pair distance transformation method (i.e. for compatibility with the [-1,1] domain over which Chebyshev polynomials are defined). Currently, the only options are ``MORSE`` and ``DEFAULT`` (i.e. "direct"). See reference "Water-1" in :ref:`Citing ChIMES ` for additional details. If ``MORSE`` is selected, meaningful values of ``MORSE_LAMBDA`` should be specified. ``EXCLUDE`` """""""""""" Interactions corresponding to specific atom triplets can be excluded from the fitting process by including the following lines above the ``NATMTYP`` entry in the input file, e.g. to exclude O-O-O (comprised of OO, OO, and OO atom pairs) and C-O-O 3-body interactions (comprised of CO, CO, and OO pairs) from the fit: .. code-block:: bash EXCLUDE 3B INTERACTION: 2 O O O C O O Note: under certain circumstances, users might want to exclude 1- and 2-body interactions, see more details on the `ChIMES Active Learning Driver documentation `_. ``CHGCONS`` """""""""""" If the user desires to fit charges during the force matching process, they can either do so with no constraints (the default option), or by specifying n_atom_pairs -1 constraints. These constraints are added to the end of the input file, before ``# ENDFILE #``. Take water, as an example; here we have 3 pair types, OO, OH, and HH. We want to enforce that the sum of all charges is zero, and that H has half the charge of O, and we can do so by adding the following lines: .. code-block:: bash CHARGE CONSTRAINTS: OO HH OH 1000.0 -4000.0 0.0 0.0 OO HH OH 1000.0 4000.0 4000.0 0.0 The first line can be re-written as the equation: :math:`1000q_\mathrm{O}q_\mathrm{O} - 4000q_\mathrm{H}q_\mathrm{H} = 0`, and enforces the relationship that :math:`|q_\mathrm{O}| = 2|q_\mathrm{H}|`. The second line can be re-written as the equation: :math:`1000q_\mathrm{O}q_\mathrm{O}+ 4000q_\mathrm{H}q_\mathrm{H} + 4000q_\mathrm{O}q_\mathrm{H} = 0`, and enforces that the sign of :math:`q_\mathrm{OH}` needs to be opposite of :math:`q_\mathrm{HH}` and :math:`q_\mathrm{OO}` (i.e. negative), and the relationship :math:`|q_\mathrm{OH}| = 2|q_\mathrm{HH}|`. Note that each line needs entries for all atom pairs. ``SPECIAL XB MINIM`` """""""""""""""""""""""" By default, many-body inner cutoffs are taken to be equivalent to the constituent 2-body inner cutoffs. One has the option of setting all inner cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup.in file: .. code-block:: bash SPECIAL 3B S_MINIM: ALL 0.0 Otherwise, each cutoff can be specified separately through syntax similar to the following: .. code-block:: bash SPECIAL 3B S_MINIM: SPECIFIC 4 OOOOOO OO OO OO 2.00000 2.00000 2.00000 OOOHOH OO OH OH 2.00000 0.80000 0.80000 HHOHOH OH OH HH 0.80000 0.80000 1.00000 HHHHHH HH HH HH 1.00000 1.00000 1.00000 Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same ``S_MINIM`` as the 2-body interactions, where constituent pairs determine the cutoff. ``SPECIAL XB MAXIM`` """""""""""""""""""""""" By default, many-body outer cutoffs are taken to be equivalent to the constituent 2-body outer cutoffs. One has the option of setting all outer cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup,in file: .. code-block:: bash SPECIAL 3B S_MAXIM: ALL 4.0 Otherwise, each cutoff can be specified separately through syntax similar to the following: .. code-block:: bash SPECIAL 3B S_MAXIM: SPECIFIC 4 CCCCCC CC CC CC 4.4 4.4 4.4 COCOCC CC CO CO 4.4 4.0 4.0 OOCOCO CO CO OO 4.0 4.0 6.5 OOOOOO OO OO OO 6.5 6.5 6.5 Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same ``S_MAXIM`` as the 2-body interactions, where constituent pairs determine the cutoff. Note: ChIMES does not currently support many-body cutoffs to be larger than 2-body cutoffs. In addition, this specification is not compatible with LAMMPS. ``FCUTTYP`` """""""""""" This keyword specifies Chebyshev potential cutoff function types and corresponding parameters. ``FCUTTYP`` should be specified on its own line, under ``# PAIRIDX #`` entries. Currently supported options include ``CUBIC`` and ``TERSOFF ``, where ```` gives the cutoff function kick-in distance as: :math:`r_\mathrm{cut} - <\mathrm{float}>r_\mathrm{cut}`, and should take on values between 0 and 1. For example, ``TERSOFF 0.3`` modified the last ``30%`` of the interaction. ----------------- .. _sec-running: Running -------------------------------------------- Generating the design matrix (``chimes_lsq``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To generate a ChIMES design matrix and related metadata files, an input file, ``fm_setup.in`` must be created, and a training trajectory file(s) must be present at the location(s) specified by the ``TRJFILE`` option in ``fm_setup.in``. With these files in place, the following command should be run: .. code-block:: bash /path/to/repo/build/chimes_lsq fm_setup.in > fm_setup.log which will produce several output files, listed below. Note that if ``SPLITFI`` is set true in ``fm_setup.in``, some files will be output as several ``file..txt`` rather than a single ``file.txt``. ``dim*txt`` files contain additional information on this splitting. **Note that, for historical reasons, energy entries are repeated three times in** ``A.txt``, ``b.txt``, ``b-labeled.txt``, **and** ``natoms.txt`` **files.** Output files """""""""""" ``A.txt`` (The ChIMES design matrix) * For force- or force- and stress-only fits, columns contain each :math:`\frac{\partial X _{a,\mathrm{ChIMES} \{c\}}}{\partial c_{b}}` value for a given coefficient :math:`c` with index :math:`b`. * If energies are included in the fit, the final columns are the number of atoms of each type in corresponding training trajectory frame * Each row corresponds to the :math:`i`, :math:`j`, or :math:`k^{\mathrm{th}}` force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame. For easier mapping see the b-labeled.txt file * Thus, 2-body only, force-only fit with polynomial order 10 for a 25 frame trajectory containing 300 atoms, ``A.txt`` would contain :math:`25 \times 300 \times 3 = 22,500` rows and 10 columns. * If ``SPLITFI`` is true: * Rather than a single ``A.txt`` file, several ``A..txt`` files are produced, which contain a subset of chimes design matrix rows. See ``dim.txt`` below for additional details. ``b.txt`` (DFT forces, and optional stresses and energies) * This file contains the DFT per-atom forces and optionally configuration stresses and energies extracted from the training trajectory * Each row corresponds to the :math:`i`, :math:`j`, or :math:`k^{\mathrm{th}}` force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame. * Thus, 2-body only, force-only fit for a 25 frame trajectory containing 300 atoms, ``b.txt`` would contain :math:`25 \times 300 \times 3 = 22,500` rows, irrespective of model complexity (i.e. bodiedness and polynomial order). * If ``SPLITFI`` is true: * Rather than a single ``b.txt`` file, several ``b..txt`` files are produced, which contain a subset of ``b.txt`` rows. See ``dim.txt`` below for additional details. ``b-labeled.txt`` (DFT forces, and optional stresses and energies, labeled) * This file is a replicate of b.txt which contains additional information on each entry. If the row corresponds to a force, the first and second column entries will provide atom chemical symbol and corresponding DFT force. For energies, rows are label with a "+1", and for stresses, ":math:`s\_ab`", where ab indicates the corresponding tensor component (e.g. xx, yy, zz, etc.). * Additional annotation in the `b-labeled.txt file` can be realized by decoration of the optional ``traj_list.dat`` file, when the ``fm_setup.in`` keyword ``TRJFILE`` uses the ``MULTI`` option. * If ``SPLITFI`` is true: * Rather than a single ``b-labeled.txt`` file, several ``b-labeled..txt`` files are produced, which contain a subset of ``b-labeled.txt`` rows. See ``dim.txt`` below for additional details. ``dim.txt`` (Output file dimensions) * If ``SPLITFI`` is false: * The first value in ``dim.txt`` is the number of columns in ``A.txt``, and the second is the number of rows in ``A.txt``, ``b.txt``, ``b-labeled.txt``, and ``natoms.txt``. * If ``SPLITFI`` is true: * ``A.txt``, ``b.txt``, ``b-labeled.txt``, and ``natoms.txt`` are split into sub-files containing approximately the same number of lines, named like ``file..txt``, e.g. ``A.0001.txt``. Corresponding ``dim..txt`` files are produced, where the first value is the number of columns in the ``A..txt`` (which is the same in all files), and the last value is the total number of rows that would be present in each of ``A.txt``, ``b.txt``, ``b-labeled.txt``, and ``natoms.txt`` if they were output in thier unsplit form. Finally, the second and third columns in ``file..txt`` indicate the first and last row of the unsplit file contained in ``file..txt``. ``natoms.txt`` (Number of atoms in training trajectory) * If ``SPLITFI`` is false: * For each row in ``A.txt``, ``b.txt``, and ``b-labeled.txt``, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities. * If ``SPLITFI`` is true: * For each row in ``A..txt``, ``b..txt``, and ``b-labeled..txt``, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities. ``params.header`` (metadata) * Contains metadata (e.g. hyperparameters, headers, etc.) from ``fm_setup.in``. ``ff_groups.map`` (metadata) * Contains metadata (e.g. mapping between many-body atom types and thier corresponding integer type). Tips and tricks """""""""""""""""" * Inner cutoffs are typically set to the lowest sampled distance for each given pair type in the training trajectory * For molecular systems: * 2-body outer cutoffs are usually set to encompass at least the second non-bonded solvation shell (determined by examination of the DFT-MD generated radial pair distribution functions, RDFs), or around 8 Å * 3-body outer cutoffs are usually set to encompass the first non-bonded solvation shell * 4-body outer cutoffs are usually set to between the first or second RDF minimum; choice depends on the nature of the system (i.e. do we need to describe extended 4-body interactions like molecular torsions, or more short-ranged interactions like those found in small molecules? * The "Morse parameter" is usually set to the distance corresponding to the the first (bonding) peak in the in the pair RDF * There is no hard-and-fast rule for setting the 2/3/4-body polynomial orders, but 12/7/3 is generally a good starting point * Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress * If a Tersoff-style cutoff function is used, a Tersoff variable of between 0.25 and 0.75 is reasonable, where larger values modify the interaction at shorter distances, but can make it easier to obtain a smooth interaction * Be mindful of outer cutoff distances with respect to box lengths. Training data are typically generated with small (i.e. DFT-friendly) cells * If you don’t want to determine all pair/triplet/quadruplet interaction types and inner cutoffs for ``fm_setup.in`` by hand, setup a dummy ``fm_setup.in`` file with: * ``# PAIRTYP#`` set to ``CHEBYSHEV 2 2 2 -1 1``, * 2-body inner cutoffs = 0.0 * 2-body outer cutoffs = something tiny (i.e. 0.5) * Do not explicitly set 3- or 4-body outer cutoffs (the code will use the 2-body value for them) * Run the ``chimes_lsq`` code using this file * Check the bottom section of the output, which lists all this information for you ----------------- .. _sec-solving: Solving for ChIMES parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Once a ChIMES design matrix has been generated, parameters can be obtained by running ``chimes_lsq.py``. Note that this script depends on native `numpy `_ and `scipy `_, installations for `python3.x `_, and has numerous features, detailed below. If ``chimes_lsq`` was run with ``SPLITFI`` set to ``false`` and no weighting is desired, one can solve for ChIMES parameters using principal component analysis based on the singular value decomposition of the design matrix ("SVD") with default regularization (1.0e-05) via: .. code-block:: bash python3.x /path/to/repo/build/chimes_lsq.py > params.txt however we note that ``chimes_lsq`` supports many other solvers, summarized below: Supported solvers """"""""""""""""""""""""""""""" Note: Asterisks (*) indicate options described in greater detail below ========== ============================================== ======================================================= Method Additional dependencies Related ``chimes_lsq.py`` flags ========== ============================================== ======================================================= SVD None ``--eps`` ``--weights`` Fast SVD None ``--eps`` ``--weights`` LASSO `sklearn `_ ``--alpha`` ``--normalize`` ``--weights`` LASSOLARS `sklearn `_ ``--alpha`` ``--normalize`` ``--weights`` DLARS* DLARS ``--alpha`` ``--normalize`` ``--weights`` (``--dlasso_dlars_path``) ``--nodes`` ``--cores`` ``--read_output`` ``--restart`` ``--split_files`` DLASSO* DLARS ``--alpha`` ``--normalize`` ``--weights`` (``--dlasso_dlars_path``) ``--nodes`` ``--cores`` ``--read_output`` ``--restart`` ``--split_files`` ========== ============================================== ======================================================= Using ``DLARS`` and ``DLASSO`` """"""""""""""""""""""""""""""" DLARS and DLASSO provided a capability for distributed design matrix solution. This feature is particularly useful for large training sets or high complexity fits, but is best reserved for runs on high-performance computers. Options and flags """"""""""""""""""""""""""""""" ========================== =========== =============== ===================== Flag Option type Default value Description ========================== =========== =============== ===================== ``--A`` str A.txt Design matrix ``--algorithm`` str svd Fitting algorithm (Supported options: svd, lasso, lassolars, dlars, and dlasso) ``--dlasso_dlars_path`` str N/A Path to DLARS/DLASSO solver ``--alpha`` float 1.0e-04 Lasso or ridge regularization ``--b`` str b.txt Reference force file ``--cores`` int 8 DLARS/DLASSO *total* number of cores (i.e., nodes * cores-per-node) ``--eps`` float 1.0e-05 SVD regularization ``--header`` str params.header Parameter file header ``--map`` str ff_groups.map Parameter file map ``--nodes`` int 1 DLARS/DLASSO number of nodes ``--normalize`` bool False Normalize DLARS/DLASSO calculation ``--read_output`` bool False Read output from previous DLARS run ``--restart_dlasso_dlars`` str N/A Determines whether dlasso or dlars job will be restarted. Argument is the restart file name ``--split_files`` bool False LSQ code has split A matrix output (DLARS/DLASSO) ``--test_suite`` bool False Output for test suite ``--weights`` str N/A Weight file ``--active`` bool False Is this a DLARS/DLASSO run from the active learning driver? ========================== =========== =============== ===================== Choosing solvers """"""""""""""""""""""""""""""" All solvers support regularization (i.e. via ``--eps`` or ``--alpha``), which can aid overfitting mitigation. For LASSO, LASSOLARS, DLARS, and DLASSO solvers, regularization has the added benefit of automatically setting minimally-informative parameters to zero, helping minimize model size and increase efficiency. Note that these zeroed parameters can be scrubbed from the parameter file with ``post_proc_chimes_lsq.py``, e.g.: .. code-block:: bash python3.x /path/to/repo/build/post_proc_chimes_lsq.py which produces a new parameter file named ``.reduced``. Doing so speeds up the sum over Chebyshev coefficients increasing computational efficiency. Weighting """"""""""""""""""""""""""""""" Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress. Weights can be rapidly generated through: .. code-block:: bash wF=1.0 wE=0.1 wS=100.0 paste b-labeled.txt | awk -v wF="$wF" -v wE="$wE" -v wS="$wS" '{if($1=="+1"){print wE}else if($1~"s_"){print wS}else{print wF}}' > weights.txt Anatomy of ChIMES parameter file """""""""""""""""""""""""""""""" ================================== ============================================================================== Variables Descriptions ================================== ============================================================================== ``Number of variables`` The maximum number of coefficients ``Number of equations`` The number of equations to solve ``eps`` The threshold for dropping small singular values in SVD ``SVD regularization factor`` The regularization parameter for SVD algorithm ``alpha`` The regularization parameter for LASSO regularization ``RMS Force error`` The global root mean square error between training data and ChIMES predictions ``max abs variable`` The absolute value of the largest coefficient ``number of fitting vars`` The number of variables after regularization ``Bayesian Information Criterion`` The model selection criterion ``Using weighting file`` The file containing weights for fitting ``USECOUL`` If the model accounts for system charges ``FITCOUL`` If the model is fit to system charges ``USE3BCH`` If the model is fit to 3-body interactions ``USE4BCH`` If the model is fit to 4-body interactions ``EXCLD1B`` If the model excludes 1-body energy ``EXCLD2B`` If the model excludes 2-body interactions ``PAIRTYP`` 2-, 3-, and 4-body Chebyshev polynomial orders ``ATOM TYPES`` The number of atom types ``TYPEIDX`` The index of atom type ``ATM_TYP`` The label of atom type ``ATMCHRG`` The charge of atom type ``ATMMASS`` The mass of atom type ``ATOM PAIRS`` The number of pair types ``PAIRIDX`` The index of the pair type ``ATM_TY1`` The label of atom 1 in the pair ``ATM_TY2`` The label of atom 2 in the pair ``S_MINIM`` The inner cutoff for the pair ``S_MAXIM`` The outer cutoff for the pair ``CHBDIST`` The transformation style for pair distance ``MORSE_LAMBDA`` The parameter used in Morse transformation ``FCUT TYPE`` The type of cutoff function used ``SPECIAL 3B S_MAXIM`` Special outer cutoff for 3-body interactions ``ATOM PAIR TRIPLETS`` The number of triplet types ``ATOM PAIR QUARDRUPLETS`` The number of quadruplet types ``PAIRTYPE PARAMS`` The pair type index and atom types ``TRIPLETTYPE PARAMS`` The triplet type index and atom types; see code block for details ``PAIRMAPS`` The permutations of pair types ``TRIPMAPS`` The permutations of triplet types ``NO ENERGY OFFSETS`` The number of atoms types ``ENERGY OFFSET X`` The energy offset of atom type X ================================== ============================================================================== Here we provide an example of the triplet parameters. .. code-block:: bash :linenos: TRIPLET CHEBYSHEV PARAMS TRIPLETTYPE PARAMS: INDEX: 0 ATOMS: Si Si Si PAIRS: SiSi SiSi SiSi UNIQUE: 665 TOTAL: 3332 index | powers | equiv index | param index | parameter ---------------------------------------------------------------------------- 0 0 1 1 0 0 2.1714984512050e+02 1 1 0 1 0 0 2.1714984512050e+02 2 1 1 0 0 0 2.1714984512050e+02 3 0 1 2 3 1 6.2564844297350e+01 4 1 0 2 3 1 6.2564844297350e+01 5 0 2 1 3 1 6.2564844297350e+01 6 2 0 1 3 1 6.2564844297350e+01 7 1 2 0 3 1 6.2564844297350e+01 8 2 1 0 3 1 6.2564844297350e+01 9 0 1 3 9 2 3.4634223883030e+01 10 1 0 3 9 2 3.4634223883030e+01 * Line 3 specifies the index of the triplet and the atoms in the triplet. Line 4 specifies all pairs in the triplet, lists the number of unique triplet power index and the total number of triple power index. * Line 4 specifies the pair combinations used in the triplet block * Line 5: - index: Specifies the index of the coefficient in the triplet sum - powers: The powers of the three Chebyshev polynomials - equiv index: The index of the first instance of the permutationally invariant coefficient - param index: The unique coefficient index based of equiv index - parameter: The value of the coefficients Tips and tricks """"""""""""""""""""""""""""""" * Weighting and regularization can all impact quality of resulting models, and their values should be considered model hyperparameters and carefully explored for each fitting problem. * For LARS or LASSO-based solvers, regularization of 1.0e-2 or 1.0e-5 are reasonable starting points for un-normalized and normalized fits, respectively * Running ``chimes_lsq.py`` can be memory intensive. Check the size of A.txt relative to the available memory (RAM) on the workhorse machine prior to running. Checking fit quality """"""""""""""""""""""""""""""" Though it is critical to evaluate model performance on the basis of physical predictions, it can be helpful to evaluate model performance through target force, energy, or stress recovery. This can be easily obtained through: .. code-block:: bash paste b-labeled.txt force.txt | awk '{if($1=="+1") {print ($2,$3)}}' > compare-energies.txt paste b-labeled.txt force.txt | awk '{if($1~ "s_") {print ($2,$3)}}' > compare-stresses.txt paste b-labeled.txt force.txt | awk '{if(($1!~"s_")&&($1!="+1")){print ($2,$3)}}' > compare-forces.txt xmgrace compare*txt If optional ````, ````, and ```` flags in ``traj_list.dat`` yield more information in ``b-labeled.txt`` from which selections can be parsed, as can consideration of ``natoms.txt``. ----------------- .. _sec-issues: Common Issues -------------------------------------------- **My model is requiring SO many iterations to become stable** * Have you set reasonable inner cutoffs (i.e. 0.02 to 0.002 less than the smallest observed distance, set individually for each pair type)? * Have you remembered to set the penalty parameters (i.e. :math:`A_{\mathrm{p}}` from 1.0e5 to 1.0e6, :math:`d_{\mathrm{p}}` from 0.01 to 0.05)? * Have you tried using a smaller timestep during the early training stages? (e.g., if DFT simulations for a C/H/O/N system used a 0.5 fs timestep)? * ChIMES training simulations should use something closer to 0.1 – 0.2 * Small timesteps allow the ChIMES simulations to remain stable for longer, which can help make iteratively-obtained training data more informative **My model is yielding very bad RDFs/is missing significant RDF features** * Have you accidentally grabbed only sequential configurations? * Have you watched your DFT trajectory to make sure the system exhibits significant evolution? * For example physical properties of a melted material will generally not be recovered from training to only solid phase data * Have you checked that your actual training trajectory contains those features? For example, you can quickly check by computing your RDFs directly from the training trajectory and checking you peaks in all the expected positions. **My model just isn’t as good as expected** * Are you using an appropriate bodiedness/polynomial order? * Holdout cross-validation on the original training set can give you an idea of what to use. * Are your cutoffs appropriate? * Cutoffs cannot be arbitrarily short (they should encapsulate the relevant physics) *or* long (they should preclude self-interaction across the training periodic boundary conditions) .. _sec-caveats: Caveats -------------------------------------------- * ChIMES models are not inherently transferable - training set generation should be carefully considered prior to model generation * High-complexity (e.g. bodiedness and polynomial order) models for complex molecular materials (e.g. C-containing) will not generally be stable for dynamics the first time through - in these cases, iterative model fitting is usually required. See `link `_ for additional details. * ChIMES will attempt to generate single-atom energies - these values *cannot* be accurately determined unless the training set contains frames with varied atomic composition, otherwise, ChIMES assigns one atom the sum of single-atom energies for each atom type.