Generating a ChIMES model

This section assumes the goal of obtaining an atomistic ChIMES model from Kohn-Sham density functional theory (DFT) molecular dynamics (MD) generated reference data. However, ChIMES models can be fit to a broad variety of reference calculations. For additional information see Citing ChIMES.

At its simplest, ChIMES models are generated by per-atom force matching to DFT reference data, and the ChIMES Parameter Generator, chimes_lsq, aims to minimize the following objective function:

\begin{equation} F_{\mathrm{obj}} = \frac{1}{n_\mathrm{f}3n_{\mathrm{a}}} \sum^{n_\mathrm{f}}_{i=1} \left[ \sum^{n_\mathrm{a}}_{j=1} \sum^{3}_{k=1} w^2_{\mathrm{F}_{ijk}}(\Delta F_{ijk})^2 \right], \end{equation}

where \(\Delta F = F^{\mathrm{DFT}} - F^{\mathrm{ChIMES\{c\}}}\). \(F_{\mathrm{obj}}\), and \(\{c\}\) are the weighted root-mean-squared error (RMSE) and model coefficients, respectively, and the number of frames and atoms are given by \(n_\mathrm{f}\) and \(n_\mathrm{a}\), respectively. \(F_{ijk}\) indicates the \(k^{\mathrm{th}}\) Cartesian component of the force on atom \(j\) in configuration \(i\). The superscripts “ChIMES” and “DFT” indicate forces/energies predicted from the force-matched model and the DFT-MD training trajectory, respectively.

Because ChIMES is entirely linear it its fitted parameters, the above optimization problem can be recast as the following over-determined matrix equation:

\begin{equation} \mathrm{\mathbf{wMc}} = \mathrm{\mathbf{wb}_{DFT}}, \end{equation}

where \(\mathrm{\mathbf{b}_{DFT}}\) is the vector of DFT-generated per-atom force components, \(F^{\mathrm{DFT}}_{ijk}\), \(\mathrm{\mathbf{w}}\) is a diagonal matrix of weights to be applied to the elements of \(\mathrm{\mathbf{b}_{DFT}}\) and rows of \(\mathrm{\mathbf{M}}\), \(\mathrm{\mathbf{c}}\) is the diagonal matrix of generated model parameters, and \(\mathrm{\mathbf{M}}\) is the overdetermined ChIMES design matrix for which elements of \(\mathrm{\mathbf{M}}\) are given by:

\begin{equation} M _{a,b} = \frac{\partial X _{a,\mathrm{ChIMES} \{c\}}}{\partial c_{b}} \end{equation}

Please note that the matrix \(\mathrm{\mathbf{M}}\) is referred to as matrix \(\mathrm{\mathbf{A}}\) in the software and reported as A.txt.

As discussed in further detail below, we note that per-configuration energies and stress tensors can be considered in ChIMES fits, in additional to per-atom forces.

Additional details are provided below in the following sections:


Essential workflow

The basic ChIMES model development workflow comprises 5 main steps, briefly outlined below. Note:

1. Reference data preparation

  • DFT-MD simulations are run for the target material over the target thermodynamic condition(s)

  • Per-atom forces and optionally the system energy and stress tensor are extracted from the trajectory and saved in a “training trajectory” file, in a chimes_lsq compatible format

2. Hyperparameter selection

  • ChIMES model hyperparameters (e.g. bodiedness, polynomial orders, cutoffs, transformation function parameters, etc.) are specified in the chimes_lsq input file (e.g., typically named fm_setup.in), along with the training trajectory file

3. Design matrix generation

  • chimes_lsq is executed to produce the design matrix (A.txt), the vector of forces (and optionally energies and stresses) from DFT (b.txt), and other metadata files (e.g. params.header, ff_groups.map)

4. Design matrix optimization

  • ChIMES paramteters are generated via the chimes_lsq.py utility, which also allows for consideration of an optional weight

5. Performing calculations with ChIMES

  • ChIMES calculations and simulations are performed through the ChIMES Calculator, which can be linked to a variety of simulation software packages


Input files and options

Note: Asterisks (*) indicate options described in greater detail below

Control variables

Description

Value/Options/Notes

TRJFILE *

Training trajectory file(s)

See below for details.

WRAPTRJ

true/false: Coorindate wrapping

Automatically disabled when ghost atoms (layers) are used.

SPLITFI

true/false: {A,b}.txt file splitting

Should not be used unless DLARS/DLASSO solvers are used.

NFRAMES

Number of training frames

Any integer > 0.

NLAYERS

Number of supercell ghost layers

A value of 0 yields the original box. A value of 1 yields a single shell of replicated boxes around the original box (i.e. 27 boxes).

FITCOUL *

true/false: Fit/use charges

See below for details.

FITSTRS *

Whether/how to include stresses

See below for details.

FITENER *

Whether/how to include energies

See below for details.

PAIRTYP

Chebyshev polynomial orders

Expects <O2B> <O3B+1> <O4B+1> -1 1. Setting an order to zero is equivalent to excluding the corresponding interaction. For example, 12 5 0 -1 1 is equivalent to excluding 4-body interactions.

CHBTYPE *

Pair distance transformation type

See below for details.

USENEIG

Neighbor list/distance convention

Auto-select algorithm (true) or use small-cell friendly method (true SMALL)

FITPOVR

ReaxFF linear overcoordination parameters use

Boolean to decide if using ReaxFF linear overcoordination parameters. If this is false and parameters are provided below, those parameters will be subtracted from forces.

SKPFRMS

Skip frames to train to

Decide to process frames in round-robin order with skipping. Integer input.

Note: Asterisks (*) indicate options described in greater detail below

Topology variables

Description

Value/Options/Notes

NATMTYP

Number of atom types

Number of unique atom types in trajectory.

TYPEIDX

Atom type index

Integers, ranging from 1 to NATMTYP. Values form a column below # TYPEIDX #. Make sure the element order in fm_setup.in matches with the order in the topological file.

ATM_TYP

Atom type chemical symbol

Any string. Chemical symbol for each unique atom type. Values form a column below # ATM_TYP #.

ATMCHRG

Atom type charge

If FITCOUL is false: a partial atomic charge for each atom type. If FITCOUL is true: a positive or negative sign, to indicate how pair charge signs should be assigned. If ATMCHRG is set to zero, ChIMES is fit normally.

ATMMASS

Atom type mass

Floats > 0.

EXCLUDE *

Interaction to exclude/ignore

Sets of n-body ATM_TYP s. Rarely used.

PAIRIDX

Pair type index

Ascending integers from 1 to the number of unique atom pair types. Values form a column below # PAIRIDX #.

ATM_TYX

ATM_TYP of atom X in pair PAIRIDX

Used to define interaction pair types. Order does not matter.

S_MINIM

Inner cutoff for pair PAIRIDX

Generally taken as slightly less than smallest distance sampled in DFT-MD trajectory.

S_MAXIM

Outer cutoff for pair PAIRIDX

Should be small enough to prevent self-interaction across periodic boundaries.

S_DELTA

Ancillary. Value required

Must have floating point number in this location.

MORSE_LAMBDA

CHBTYPE variable

Morse-type lambda for Morse CHBTYP. Only used if # PAIRTYP # is CHEBYSHEV. Generally set to location of first radial distribution peak for each pair type.

USEOVRP

ReaxFF overbonding term

Boolean. If FITPOVR is true, will look for # PAIRTYP # with true. Asks user if they would like to use the ReaxFF overbonding term. Additional parameters required if used.

NXYBINS

Ancillary

Ancillary

CHGCONS *

Charge fitting constraints

See below for details.

SPECIAL XB MINIM *

Special manybody inner cutoffs

See below for details.

SPECIAL XB MAXIM *

Special manybody outer cutoffs

See below for details.

FCUTTYP *

Cutoff function style/parameters

See below for details.

Additional details on:

TRJFILE

This keyword provides the name of the simulation trajectory file. Files use a .xyzf format, which is like the standard .xyz format, with two exceptions: (1) the line after that specifying number of atoms will contain information on box dimensions and optionally stress tensor and system energy, and (2) each coordinate line has x, y, and z forces on the corresponding atom appended. Options are: 1. <any string>: Tell the program to search for a single trajectory file name. 2. MULTI <any string>: Tell the program to expect multiple trajectory files. Here, <any string> is the name of a file, e.g. traj_list.dat structured like:

<nfiles>
<frames to read> path/to/file-1.xyzf
<frames to read> path/to/file-2.xyzf
...
<frames to read> path/to/file-n.xyzf

Note that, when using the MULTI option, # NFRAMES # must be equal to the sum of <frames to read> for each of the <nfiles>.

As is described in the Output files section below, traj_list.dat can contain optional decoration which is useful for improving output annotation and/or automating fitting work flows. For example:

<nfiles>
<frames to read> path/to/file-1.xyzf  <F_flag> <S_flag> <E_flag>
<frames to read> path/to/file-2.xyzf  <string> <F_flag> <S_flag> <E_flag>
...
<frames to read> path/to/file-n.xyzf  <string>

where <string>, <F_flag>, <S_flag>, and <E_flag> represent an optional string ignored by the program, and strings to be prepended to labels in b-labeled.txt for force, stress, and energy entries, respectively. Note that none of these annotations can contain whitespace, and that annotation order cannot be randomized.

Comment lines in trajectory file frames are formatted like:

<box x-len> <box y-len> <box z-len> <s_xx> <s_yy> <s_zz> <s_xy> <s_xz> <s_yz> <energy>

or, if the first word on the line is “NON-ORTHO”:

NON-ORTHO <latvec-1_x> <latvec-1_y> <latvec-1_z> <latvec-2_x> <latvec-2_y> <latvec-2_z> <latvec-3_x> <latvec-3_y> <latvec-3_z> <s_xx> <s_yy> <s_zz> <s_xy> <s_xz> <s_yz> <energy>

where \(s\_ab\) are the \(ab\) stress tensors and energy is the overall system energy, and \(\mathrm{latvec-}i\_a\) is the \(a^{\rm{th}}\) component of the \(i^{\rm{th}}\) lattice vector. Note that stress tensors and energies are optional and thier inclusion is indicated by FITSTRS and FITENER in the fm_setup.in file. Additional details can be found in the corresponding sections below.

FITSTRS

This keyword controls if and how per-configuration stress tensors are included in the fit. Options are:

  • false:
    • Do not include any stress tensor information in the fit

  • true:
    • Fit to stress tensors for all training trajectory frames, but only expect/consider \(xx\), \(yy\), and \(zz\) stress tensor components

  • ALL:
    • Fit to stress tensors for all training trajectory frames, and expect/consider all stress tensor components

  • FIRST <integer>:
    • Fit to stress tensors for the overall first <integer> training trajectory frames, but only expect/consider \(xx\), \(yy\), and \(zz\)

  • FIRSTALL <integer>:
    • Fit to stress tensors for the overall first <integer> training trajectory frames, and expect/consider all stress tensor components

FITENER

This keyword controls if and how per-configuration energies are included in the fit. Options are:

  • false:
    • Do not include any energy information in the fit

  • true:
    • Include/expect an energy for each configuration in the training trajectory(s)

  • FIRST <integer>:
    • Include/expect an energy for the overall first <integer> configurations in the training trajectory(s)

FITCOUL

This keyword defines whether charges should be fit, or held fixed at user-defined values. Note that currently, functionality is only supported when FITCOUL is true, or when FITCOUL is false and all charges are zero. If FITCOUL is false, but charges are non-zero, program will attempt to subtract charge contributions from forces.

CHBTYPE

This keyword defined the pair distance transformation method (i.e. for compatibility with the [-1,1] domain over which Chebyshev polynomials are defined). Currently, the only options are MORSE and DEFAULT (i.e. “direct”). See reference “Water-1” in Citing ChIMES for additional details. If MORSE is selected, meaningful values of MORSE_LAMBDA should be specified.

EXCLUDE

Interactions corresponding to specific atom triplets can be excluded from the fitting process by including the following lines above the NATMTYP entry in the input file, e.g. to exclude O-O-O (comprised of OO, OO, and OO atom pairs) and C-O-O 3-body interactions (comprised of CO, CO, and OO pairs) from the fit:

EXCLUDE 3B INTERACTION: 2
O O O
C O O

Note: under certain circumstances, users might want to exclude 1- and 2-body interactions, see more details on the ChIMES Active Learning Driver documentation.

CHGCONS

If the user desires to fit charges during the force matching process, they can either do so with no constraints (the default option), or by specifying n_atom_pairs -1 constraints. These constraints are added to the end of the input file, before # ENDFILE #. Take water, as an example; here we have 3 pair types, OO, OH, and HH. We want to enforce that the sum of all charges is zero, and that H has half the charge of O, and we can do so by adding the following lines:

CHARGE CONSTRAINTS:
OO  HH  OH  1000.0  -4000.0  0.0     0.0
OO  HH  OH  1000.0   4000.0  4000.0  0.0

The first line can be re-written as the equation: \(1000q_\mathrm{O}q_\mathrm{O} - 4000q_\mathrm{H}q_\mathrm{H} = 0\), and enforces the relationship that \(|q_\mathrm{O}| = 2|q_\mathrm{H}|\). The second line can be re-written as the equation: \(1000q_\mathrm{O}q_\mathrm{O}+ 4000q_\mathrm{H}q_\mathrm{H} + 4000q_\mathrm{O}q_\mathrm{H} = 0\), and enforces that the sign of \(q_\mathrm{OH}\) needs to be opposite of \(q_\mathrm{HH}\) and \(q_\mathrm{OO}\) (i.e. negative), and the relationship \(|q_\mathrm{OH}| = 2|q_\mathrm{HH}|\). Note that each line needs entries for all atom pairs.

SPECIAL XB MINIM

By default, many-body inner cutoffs are taken to be equivalent to the constituent 2-body inner cutoffs. One has the option of setting all inner cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup.in file:

SPECIAL 3B S_MINIM: ALL 0.0

Otherwise, each cutoff can be specified separately through syntax similar to the following:

SPECIAL 3B S_MINIM: SPECIFIC 4
OOOOOO OO OO OO 2.00000 2.00000 2.00000
OOOHOH OO OH OH 2.00000 0.80000 0.80000
HHOHOH OH OH HH 0.80000 0.80000 1.00000
HHHHHH HH HH HH 1.00000 1.00000 1.00000

Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same S_MINIM as the 2-body interactions, where constituent pairs determine the cutoff.

SPECIAL XB MAXIM

By default, many-body outer cutoffs are taken to be equivalent to the constituent 2-body outer cutoffs. One has the option of setting all outer cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup,in file:

SPECIAL 3B S_MAXIM: ALL 4.0

Otherwise, each cutoff can be specified separately through syntax similar to the following:

SPECIAL 3B S_MAXIM: SPECIFIC 4
CCCCCC CC CC CC 4.4 4.4 4.4
COCOCC CC CO CO 4.4 4.0 4.0
OOCOCO CO CO OO 4.0 4.0 6.5
OOOOOO OO OO OO 6.5 6.5 6.5

Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same S_MAXIM as the 2-body interactions, where constituent pairs determine the cutoff.

Note: ChIMES does not currently support many-body cutoffs to be larger than 2-body cutoffs. In addition, this specification is not compatible with LAMMPS.

FCUTTYP

This keyword specifies Chebyshev potential cutoff function types and corresponding parameters. FCUTTYP should be specified on its own line, under # PAIRIDX # entries. Currently supported options include CUBIC and TERSOFF <float>, where <float> gives the cutoff function kick-in distance as: \(r_\mathrm{cut} - <\mathrm{float}>r_\mathrm{cut}\), and should take on values between 0 and 1. For example, TERSOFF 0.3 modified the last 30% of the interaction.


Running

Generating the design matrix (chimes_lsq)

To generate a ChIMES design matrix and related metadata files, an input file, fm_setup.in must be created, and a training trajectory file(s) must be present at the location(s) specified by the TRJFILE option in fm_setup.in. With these files in place, the following command should be run:

/path/to/repo/build/chimes_lsq fm_setup.in > fm_setup.log

which will produce several output files, listed below. Note that if SPLITFI is set true in fm_setup.in, some files will be output as several file.<zero-padded-number>.txt rather than a single file.txt. dim*txt files contain additional information on this splitting.

Note that, for historical reasons, energy entries are repeated three times in A.txt, b.txt, b-labeled.txt, and natoms.txt files.

Output files

A.txt (The ChIMES design matrix)

  • For force- or force- and stress-only fits, columns contain each \(\frac{\partial X _{a,\mathrm{ChIMES} \{c\}}}{\partial c_{b}}\) value for a given coefficient \(c\) with index \(b\).

    • If energies are included in the fit, the final columns are the number of atoms of each type in corresponding training trajectory frame

  • Each row corresponds to the \(i\), \(j\), or \(k^{\mathrm{th}}\) force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame. For easier mapping see the b-labeled.txt file

    • Thus, 2-body only, force-only fit with polynomial order 10 for a 25 frame trajectory containing 300 atoms, A.txt would contain \(25 \times 300 \times 3 = 22,500\) rows and 10 columns.

  • If SPLITFI is true:

    • Rather than a single A.txt file, several A.<zero-padded-number>.txt files are produced, which contain a subset of chimes design matrix rows. See dim.txt below for additional details.

b.txt (DFT forces, and optional stresses and energies)

  • This file contains the DFT per-atom forces and optionally configuration stresses and energies extracted from the training trajectory

  • Each row corresponds to the \(i\), \(j\), or \(k^{\mathrm{th}}\) force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame.

    • Thus, 2-body only, force-only fit for a 25 frame trajectory containing 300 atoms, b.txt would contain \(25 \times 300 \times 3 = 22,500\) rows, irrespective of model complexity (i.e. bodiedness and polynomial order).

  • If SPLITFI is true:

    • Rather than a single b.txt file, several b.<zero-padded-number>.txt files are produced, which contain a subset of b.txt rows. See dim.txt below for additional details.

b-labeled.txt (DFT forces, and optional stresses and energies, labeled)

  • This file is a replicate of b.txt which contains additional information on each entry. If the row corresponds to a force, the first and second column entries will provide atom chemical symbol and corresponding DFT force. For energies, rows are label with a “+1”, and for stresses, “\(s\_ab\)”, where ab indicates the corresponding tensor component (e.g. xx, yy, zz, etc.).

  • Additional annotation in the b-labeled.txt file can be realized by decoration of the optional traj_list.dat file, when the fm_setup.in keyword TRJFILE uses the MULTI option.

  • If SPLITFI is true:

    • Rather than a single b-labeled.txt file, several b-labeled.<zero-padded-number>.txt files are produced, which contain a subset of b-labeled.txt rows. See dim.txt below for additional details.

dim.txt (Output file dimensions)

  • If SPLITFI is false:

    • The first value in dim.txt is the number of columns in A.txt, and the second is the number of rows in A.txt, b.txt, b-labeled.txt, and natoms.txt.

  • If SPLITFI is true:

    • A.txt, b.txt, b-labeled.txt, and natoms.txt are split into sub-files containing approximately the same number of lines, named like file.<zero-padded-number>.txt, e.g. A.0001.txt. Corresponding dim.<zero-padded-number>.txt files are produced, where the first value is the number of columns in the A.<zero-padded-number>.txt (which is the same in all files), and the last value is the total number of rows that would be present in each of A.txt, b.txt, b-labeled.txt, and natoms.txt if they were output in thier unsplit form. Finally, the second and third columns in file.<zero-padded-number>.txt indicate the first and last row of the unsplit file contained in file.<zero-padded-number>.txt.

natoms.txt (Number of atoms in training trajectory)

  • If SPLITFI is false:

    • For each row in A.txt, b.txt, and b-labeled.txt, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities.

  • If SPLITFI is true:

    • For each row in A.<zero-padded-number>.txt, b.<zero-padded-number>.txt, and b-labeled.<zero-padded-number>.txt, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities.

params.header (metadata)

  • Contains metadata (e.g. hyperparameters, headers, etc.) from fm_setup.in.

ff_groups.map (metadata)

  • Contains metadata (e.g. mapping between many-body atom types and thier corresponding integer type).

Tips and tricks

  • Inner cutoffs are typically set to the lowest sampled distance for each given pair type in the training trajectory

  • For molecular systems:

    • 2-body outer cutoffs are usually set to encompass at least the second non-bonded solvation shell (determined by examination of the DFT-MD generated radial pair distribution functions, RDFs), or around 8 Å

    • 3-body outer cutoffs are usually set to encompass the first non-bonded solvation shell

    • 4-body outer cutoffs are usually set to between the first or second RDF minimum; choice depends on the nature of the system (i.e. do we need to describe extended 4-body interactions like molecular torsions, or more short-ranged interactions like those found in small molecules?

  • The “Morse parameter” is usually set to the distance corresponding to the the first (bonding) peak in the in the pair RDF

  • There is no hard-and-fast rule for setting the 2/3/4-body polynomial orders, but 12/7/3 is generally a good starting point

  • Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress

  • If a Tersoff-style cutoff function is used, a Tersoff variable of between 0.25 and 0.75 is reasonable, where larger values modify the interaction at shorter distances, but can make it easier to obtain a smooth interaction

  • Be mindful of outer cutoff distances with respect to box lengths. Training data are typically generated with small (i.e. DFT-friendly) cells

  • If you don’t want to determine all pair/triplet/quadruplet interaction types and inner cutoffs for fm_setup.in by hand, setup a dummy fm_setup.in file with:

    • # PAIRTYP# set to CHEBYSHEV 2 2 2 -1 1,

    • 2-body inner cutoffs = 0.0

    • 2-body outer cutoffs = something tiny (i.e. 0.5)

    • Do not explicitly set 3- or 4-body outer cutoffs (the code will use the 2-body value for them)

    • Run the chimes_lsq code using this file

    • Check the bottom section of the output, which lists all this information for you


Solving for ChIMES parameters

Once a ChIMES design matrix has been generated, parameters can be obtained by running chimes_lsq.py. Note that this script depends on native numpy and scipy, installations for python3.x, and has numerous features, detailed below. If chimes_lsq was run with SPLITFI set to false and no weighting is desired, one can solve for ChIMES parameters using principal component analysis based on the singular value decomposition of the design matrix (“SVD”) with default regularization (1.0e-05) via:

python3.x /path/to/repo/build/chimes_lsq.py > params.txt

however we note that chimes_lsq supports many other solvers, summarized below:

Supported solvers

Note: Asterisks (*) indicate options described in greater detail below

Method

Additional dependencies

Related chimes_lsq.py flags

SVD

None

--eps --weights

Fast SVD

None

--eps --weights

LASSO

sklearn

--alpha --normalize --weights

LASSOLARS

sklearn

--alpha --normalize --weights

DLARS*

DLARS

--alpha --normalize --weights (--dlasso_dlars_path) --nodes --cores --read_output --restart --split_files

DLASSO*

DLARS

--alpha --normalize --weights (--dlasso_dlars_path) --nodes --cores --read_output --restart --split_files

Using DLARS and DLASSO

DLARS and DLASSO provided a capability for distributed design matrix solution. This feature is particularly useful for large training sets or high complexity fits, but is best reserved for runs on high-performance computers.

Options and flags

Flag

Option type

Default value

Description

--A

str

A.txt

Design matrix

--algorithm

str

svd

Fitting algorithm (Supported options: svd, lasso, lassolars, dlars, and dlasso)

--dlasso_dlars_path

str

N/A

Path to DLARS/DLASSO solver

--alpha

float

1.0e-04

Lasso or ridge regularization

--b

str

b.txt

Reference force file

--cores

int

8

DLARS/DLASSO total number of cores (i.e., nodes * cores-per-node)

--eps

float

1.0e-05

SVD regularization

--header

str

params.header

Parameter file header

--map

str

ff_groups.map

Parameter file map

--nodes

int

1

DLARS/DLASSO number of nodes

--normalize

bool

False

Normalize DLARS/DLASSO calculation

--read_output

bool

False

Read output from previous DLARS run

--restart_dlasso_dlars

str

N/A

Determines whether dlasso or dlars job will be restarted. Argument is the restart file name

--split_files

bool

False

LSQ code has split A matrix output (DLARS/DLASSO)

--test_suite

bool

False

Output for test suite

--weights

str

N/A

Weight file

--active

bool

False

Is this a DLARS/DLASSO run from the active learning driver?

Choosing solvers

All solvers support regularization (i.e. via --eps or --alpha), which can aid overfitting mitigation. For LASSO, LASSOLARS, DLARS, and DLASSO solvers, regularization has the added benefit of automatically setting minimally-informative parameters to zero, helping minimize model size and increase efficiency. Note that these zeroed parameters can be scrubbed from the parameter file with post_proc_chimes_lsq.py, e.g.:

python3.x /path/to/repo/build/post_proc_chimes_lsq.py <parameter_file>

which produces a new parameter file named <parameter_file>.reduced. Doing so speeds up the sum over Chebyshev coefficients increasing computational efficiency.

Weighting

Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress. Weights can be rapidly generated through:

wF=1.0
wE=0.1
wS=100.0

paste b-labeled.txt | awk -v wF="$wF" -v wE="$wE" -v wS="$wS" '{if($1=="+1"){print wE}else if($1~"s_"){print wS}else{print wF}}' > weights.txt

Anatomy of ChIMES parameter file

Variables

Descriptions

Number of variables

The maximum number of coefficients

Number of equations

The number of equations to solve

eps

The threshold for dropping small singular values in SVD

SVD regularization factor

The regularization parameter for SVD algorithm

alpha

The regularization parameter for LASSO regularization

RMS Force error

The global root mean square error between training data and ChIMES predictions

max abs variable

The absolute value of the largest coefficient

number of fitting vars

The number of variables after regularization

Bayesian Information Criterion

The model selection criterion

Using weighting file

The file containing weights for fitting

USECOUL

If the model accounts for system charges

FITCOUL

If the model is fit to system charges

USE3BCH

If the model is fit to 3-body interactions

USE4BCH

If the model is fit to 4-body interactions

EXCLD1B

If the model excludes 1-body energy

EXCLD2B

If the model excludes 2-body interactions

PAIRTYP

2-, 3-, and 4-body Chebyshev polynomial orders

ATOM TYPES

The number of atom types

TYPEIDX

The index of atom type

ATM_TYP

The label of atom type

ATMCHRG

The charge of atom type

ATMMASS

The mass of atom type

ATOM PAIRS

The number of pair types

PAIRIDX

The index of the pair type

ATM_TY1

The label of atom 1 in the pair

ATM_TY2

The label of atom 2 in the pair

S_MINIM

The inner cutoff for the pair

S_MAXIM

The outer cutoff for the pair

CHBDIST

The transformation style for pair distance

MORSE_LAMBDA

The parameter used in Morse transformation

FCUT TYPE

The type of cutoff function used

SPECIAL 3B S_MAXIM

Special outer cutoff for 3-body interactions

ATOM PAIR TRIPLETS

The number of triplet types

ATOM PAIR QUARDRUPLETS

The number of quadruplet types

PAIRTYPE PARAMS

The pair type index and atom types

TRIPLETTYPE PARAMS

The triplet type index and atom types; see code block for details

PAIRMAPS

The permutations of pair types

TRIPMAPS

The permutations of triplet types

NO ENERGY OFFSETS

The number of atoms types

ENERGY OFFSET X

The energy offset of atom type X

Here we provide an example of the triplet parameters.

 1TRIPLET CHEBYSHEV PARAMS
 2TRIPLETTYPE PARAMS:
 3INDEX: 0 ATOMS: Si Si Si
 4PAIRS: SiSi SiSi SiSi UNIQUE: 665 TOTAL: 3332
 5    index  |  powers  |  equiv index  |  param index  |       parameter
 6----------------------------------------------------------------------------
 7    0       0  1  1         0               0          2.1714984512050e+02
 8    1       1  0  1         0               0          2.1714984512050e+02
 9    2       1  1  0         0               0          2.1714984512050e+02
10    3       0  1  2         3               1          6.2564844297350e+01
11    4       1  0  2         3               1          6.2564844297350e+01
12    5       0  2  1         3               1          6.2564844297350e+01
13    6       2  0  1         3               1          6.2564844297350e+01
14    7       1  2  0         3               1          6.2564844297350e+01
15    8       2  1  0         3               1          6.2564844297350e+01
16    9       0  1  3         9               2          3.4634223883030e+01
17    10      1  0  3         9               2          3.4634223883030e+01
  • Line 3 specifies the index of the triplet and the atoms in the triplet. Line 4 specifies all pairs in the triplet, lists the number of unique triplet power index and the total number of triple power index.

  • Line 4 specifies the pair combinations used in the triplet block

  • Line 5:

    • index: Specifies the index of the coefficient in the triplet sum

    • powers: The powers of the three Chebyshev polynomials

    • equiv index: The index of the first instance of the permutationally invariant coefficient

    • param index: The unique coefficient index based of equiv index

    • parameter: The value of the coefficients

Tips and tricks

  • Weighting and regularization can all impact quality of resulting models, and their values should be considered model hyperparameters and carefully explored for each fitting problem.

  • For LARS or LASSO-based solvers, regularization of 1.0e-2 or 1.0e-5 are reasonable starting points for un-normalized and normalized fits, respectively

  • Running chimes_lsq.py can be memory intensive. Check the size of A.txt relative to the available memory (RAM) on the workhorse machine prior to running.

Checking fit quality

Though it is critical to evaluate model performance on the basis of physical predictions, it can be helpful to evaluate model performance through target force, energy, or stress recovery. This can be easily obtained through:

paste b-labeled.txt force.txt | awk '{if($1=="+1")              {print ($2,$3)}}' > compare-energies.txt
paste b-labeled.txt force.txt | awk '{if($1~ "s_")              {print ($2,$3)}}' > compare-stresses.txt
paste b-labeled.txt force.txt | awk '{if(($1!~"s_")&&($1!="+1")){print ($2,$3)}}' > compare-forces.txt

xmgrace compare*txt

If optional <F_flag>, <S_flag>, and <E_flag> flags in traj_list.dat yield more information in b-labeled.txt from which selections can be parsed, as can consideration of natoms.txt.


Common Issues

My model is requiring SO many iterations to become stable

  • Have you set reasonable inner cutoffs (i.e. 0.02 to 0.002 less than the smallest observed distance, set individually for each pair type)?

  • Have you remembered to set the penalty parameters (i.e. \(A_{\mathrm{p}}\) from 1.0e5 to 1.0e6, \(d_{\mathrm{p}}\) from 0.01 to 0.05)?

  • Have you tried using a smaller timestep during the early training stages? (e.g., if DFT simulations for a C/H/O/N system used a 0.5 fs timestep)?

    • ChIMES training simulations should use something closer to 0.1 – 0.2

    • Small timesteps allow the ChIMES simulations to remain stable for longer, which can help make iteratively-obtained training data more informative

My model is yielding very bad RDFs/is missing significant RDF features

  • Have you accidentally grabbed only sequential configurations?

  • Have you watched your DFT trajectory to make sure the system exhibits significant evolution?

    • For example physical properties of a melted material will generally not be recovered from training to only solid phase data

  • Have you checked that your actual training trajectory contains those features? For example, you can quickly check by computing your RDFs directly from the training trajectory and checking you peaks in all the expected positions.

My model just isn’t as good as expected

  • Are you using an appropriate bodiedness/polynomial order?

    • Holdout cross-validation on the original training set can give you an idea of what to use.

  • Are your cutoffs appropriate?

    • Cutoffs cannot be arbitrarily short (they should encapsulate the relevant physics) or long (they should preclude self-interaction across the training periodic boundary conditions)

Caveats

  • ChIMES models are not inherently transferable - training set generation should be carefully considered prior to model generation

  • High-complexity (e.g. bodiedness and polynomial order) models for complex molecular materials (e.g. C-containing) will not generally be stable for dynamics the first time through - in these cases, iterative model fitting is usually required. See link for additional details.

  • ChIMES will attempt to generate single-atom energies - these values cannot be accurately determined unless the training set contains frames with varied atomic composition, otherwise, ChIMES assigns one atom the sum of single-atom energies for each atom type.