Generating a ChIMES model

This section assumes the goal of obtaining an atomistic ChIMES model from Kohn-Sham density functional theory (DFT) molecular dynamics (MD) generated reference data. However, ChIMES models can be fit to a broad variety of reference calculations. For additional information see Citing ChIMES.

At its simplest, ChIMES models are generated by per-atom force matching to DFT reference data, and the ChIMES Parameter Generator, chimes_lsq, aims to minimize the following objective function:

\begin{equation} F_{\mathrm{obj}} = \frac{1}{n_\mathrm{f}3n_{\mathrm{a}}} \sum^{n_\mathrm{f}}_{i=1} \left[ \sum^{n_\mathrm{a}}_{j=1} \sum^{3}_{k=1} w^2_{\mathrm{F}_{ijk}}(\Delta F_{ijk})^2 \right], \end{equation}

where \(\Delta F = F^{\mathrm{DFT}} - F^{\mathrm{ChIMES\{c\}}}\). \(F_{\mathrm{obj}}\), and \(\{c\}\) are the weighted root-mean-squared error (RMSE) and model coefficients, respectively, and the number of frames and atoms are given by \(n_\mathrm{f}\) and \(n_\mathrm{a}\), respectively. \(F_{ijk}\) indicates the \(k^{\mathrm{th}}\) Cartesian component of the force on atom \(j\) in configuration \(i\). The superscripts “ChIMES” and “DFT” indicate forces/energies predicted from the force-matched model and the DFT-MD training trajectory, respectively.

Because ChIMES is entirely linear it its fitted parameters, the above optimization problem can be recast as the following over-determined matrix equation:

\begin{equation} \mathrm{\mathbf{wMc}} = \mathrm{\mathbf{wb}_{DFT}}, \end{equation}

where \(\mathrm{\mathbf{b}_{DFT}}\) is the vector of DFT-generated per-atom force components, \(F^{\mathrm{DFT}}_{ijk}\), \(\mathrm{\mathbf{w}}\) is a diagonal matrix of weights to be applied to the elements of \(\mathrm{\mathbf{b}_{DFT}}\) and rows of \(\mathrm{\mathbf{M}}\), \(\mathrm{\mathbf{c}}\) is the diagonal matrix of generated model parameters, and \(\mathrm{\mathbf{M}}\) is the overdetermined ChIMES design matrix for which elements of \(\mathrm{\mathbf{M}}\) are given by:

\begin{equation} M _{a,b} = \frac{\partial X _{a,\mathrm{ChIMES} \{c\}}}{\partial c_{b}} \end{equation}

Please note that the matrix \(\mathrm{\mathbf{M}}\) is referred to as matrix \(\mathrm{\mathbf{A}}\) in the software and reported as A.txt.

As discussed in further detail below, we note that per-configuration energies and stress tensors can be considered in ChIMES fits, in additional to per-atom forces.

Additional details are provided below in the following sections:

Essential workflow
Input files and options
Running
Solving for ChIMES parameters
Caveats

Essential workflow

The basic ChIMES model development workflow comprises 5 main steps, briefly outlined below. Note:

1. Reference data preparation

DFT-MD simulations are run for the target material over the target thermodynamic condition(s)
Per-atom forces and optionally the system energy and stress tensor are extracted from the trajectory and saved in a “training trajectory” file, in a chimes_lsq compatible format

2. Hyperparameter selection

ChIMES model hyperparameters (e.g. bodiedness, polynomial orders, cutoffs, transformation function parameters, etc.) are specified in the chimes_lsq input file (e.g., typically named fm_setup.in), along with the training trajectory file

3. Design matrix generation

chimes_lsq is executed to produce the design matrix (A.txt), the vector of forces (and optionally energies and stresses) from DFT (b.txt), and other metadata files (e.g. params.header, ff_groups.map)

4. Design matrix optimization

ChIMES paramteters are generated via the chimes_lsq.py utility, which also allows for consideration of an optional weight

5. Performing calculations with ChIMES

ChIMES calculations and simulations are performed through the ChIMES Calculator, which can be linked to a variety of simulation software packages

Input files and options

Note: Asterisks (*) indicate options described in greater detail below

Control variables	Description	Value/Options/Notes
`TRJFILE` *	Training trajectory file(s)	See below for details.
`WRAPTRJ`	`true`/`false`: Coorindate wrapping	Automatically disabled when ghost atoms (layers) are used.
`SPLITFI`	`true`/`false`: {A,b}.txt file splitting	Should not be used unless DLARS/DLASSO solvers are used.
`NFRAMES`	Number of training frames	Any integer > 0.
`NLAYERS`	Number of supercell ghost layers	A value of 0 yields the original box. A value of 1 yields a single shell of replicated boxes around the original box (i.e. 27 boxes).
`FITCOUL` *	`true`/`false`: Fit/use charges	See below for details.
`FITSTRS` *	Whether/how to include stresses	See below for details.
`FITENER` *	Whether/how to include energies	See below for details.
`PAIRTYP`	Chebyshev polynomial orders	Expects `<O2B> <O3B+1> <O4B+1> -1 1`. Setting an order to zero is equivalent to excluding the corresponding interaction. For example, `12 5 0 -1 1` is equivalent to excluding 4-body interactions.
`CHBTYPE` *	Pair distance transformation type	See below for details.
`USENEIG`	Neighbor list/distance convention	Auto-select algorithm (`true`) or use small-cell friendly method (`true SMALL`)
`FITPOVR`	ReaxFF linear overcoordination parameters use	Boolean to decide if using ReaxFF linear overcoordination parameters. If this is false and parameters are provided below, those parameters will be subtracted from forces.
`SKPFRMS`	Skip frames to train to	Decide to process frames in round-robin order with skipping. Integer input.

Note: Asterisks (*) indicate options described in greater detail below

Topology variables	Description	Value/Options/Notes
`NATMTYP`	Number of atom types	Number of unique atom types in trajectory.
`TYPEIDX`	Atom type index	Integers, ranging from 1 to `NATMTYP`. Values form a column below `# TYPEIDX #`. Make sure the element order in `fm_setup.in` matches with the order in the topological file.
`ATM_TYP`	Atom type chemical symbol	Any string. Chemical symbol for each unique atom type. Values form a column below `# ATM_TYP #`.
`ATMCHRG`	Atom type charge	If `FITCOUL` is false: a partial atomic charge for each atom type. If `FITCOUL` is true: a positive or negative sign, to indicate how pair charge signs should be assigned. If `ATMCHRG` is set to zero, ChIMES is fit normally.
`ATMMASS`	Atom type mass	Floats > 0.
`EXCLUDE` *	Interaction to exclude/ignore	Sets of n-body `ATM_TYP` s. Rarely used.
`PAIRIDX`	Pair type index	Ascending integers from 1 to the number of unique atom pair types. Values form a column below `# PAIRIDX #`.
`ATM_TYX`	`ATM_TYP` of atom X in pair `PAIRIDX`	Used to define interaction pair types. Order does not matter.
`S_MINIM`	Inner cutoff for pair `PAIRIDX`	Generally taken as slightly less than smallest distance sampled in DFT-MD trajectory.
`S_MAXIM`	Outer cutoff for pair `PAIRIDX`	Should be small enough to prevent self-interaction across periodic boundaries.
`S_DELTA`	Ancillary. Value required	Must have floating point number in this location.
`MORSE_LAMBDA`	`CHBTYPE` variable	Morse-type lambda for Morse `CHBTYP`. Only used if `# PAIRTYP #` is `CHEBYSHEV`. Generally set to location of first radial distribution peak for each pair type.
`USEOVRP`	ReaxFF overbonding term	Boolean. If `FITPOVR` is true, will look for `# PAIRTYP #` with true. Asks user if they would like to use the ReaxFF overbonding term. Additional parameters required if used.
`NXYBINS`	Ancillary	Ancillary
`CHGCONS` *	Charge fitting constraints	See below for details.
`SPECIAL XB MINIM` *	Special manybody inner cutoffs	See below for details.
`SPECIAL XB MAXIM` *	Special manybody outer cutoffs	See below for details.
`FCUTTYP` *	Cutoff function style/parameters	See below for details.

Additional details on:

`TRJFILE`

This keyword provides the name of the simulation trajectory file. Files use a .xyzf format, which is like the standard .xyz format, with two exceptions: (1) the line after that specifying number of atoms will contain information on box dimensions and optionally stress tensor and system energy, and (2) each coordinate line has x, y, and z forces on the corresponding atom appended. Options are: 1. <any string>: Tell the program to search for a single trajectory file name. 2. MULTI <any string>: Tell the program to expect multiple trajectory files. Here, <any string> is the name of a file, e.g. traj_list.dat structured like:

<nfiles>
<frames to read> path/to/file-1.xyzf
<frames to read> path/to/file-2.xyzf
...
<frames to read> path/to/file-n.xyzf

Note that, when using the MULTI option, # NFRAMES # must be equal to the sum of <frames to read> for each of the <nfiles>.

As is described in the Output files section below, traj_list.dat can contain optional decoration which is useful for improving output annotation and/or automating fitting work flows. For example:

<nfiles>
<frames to read> path/to/file-1.xyzf  <F_flag> <S_flag> <E_flag>
<frames to read> path/to/file-2.xyzf  <string> <F_flag> <S_flag> <E_flag>
...
<frames to read> path/to/file-n.xyzf  <string>

where <string>, <F_flag>, <S_flag>, and <E_flag> represent an optional string ignored by the program, and strings to be prepended to labels in b-labeled.txt for force, stress, and energy entries, respectively. Note that none of these annotations can contain whitespace, and that annotation order cannot be randomized.

Comment lines in trajectory file frames are formatted like:

<box x-len> <box y-len> <box z-len> <s_xx> <s_yy> <s_zz> <s_xy> <s_xz> <s_yz> <energy>

or, if the first word on the line is “NON-ORTHO”:

NON-ORTHO <latvec-1_x> <latvec-1_y> <latvec-1_z> <latvec-2_x> <latvec-2_y> <latvec-2_z> <latvec-3_x> <latvec-3_y> <latvec-3_z> <s_xx> <s_yy> <s_zz> <s_xy> <s_xz> <s_yz> <energy>

where \(s\_ab\) are the \(ab\) stress tensors and energy is the overall system energy, and \(\mathrm{latvec-}i\_a\) is the \(a^{\rm{th}}\) component of the \(i^{\rm{th}}\) lattice vector. Note that stress tensors and energies are optional and thier inclusion is indicated by FITSTRS and FITENER in the fm_setup.in file. Additional details can be found in the corresponding sections below.

`FITSTRS`

This keyword controls if and how per-configuration stress tensors are included in the fit. Options are:

false:
- Do not include any stress tensor information in the fit
true:
- Fit to stress tensors for all training trajectory frames, but only expect/consider \(xx\), \(yy\), and \(zz\) stress tensor components
ALL:
- Fit to stress tensors for all training trajectory frames, and expect/consider all stress tensor components
FIRST <integer>:
- Fit to stress tensors for the overall first <integer> training trajectory frames, but only expect/consider \(xx\), \(yy\), and \(zz\)
FIRSTALL <integer>:
- Fit to stress tensors for the overall first <integer> training trajectory frames, and expect/consider all stress tensor components

`FITENER`

This keyword controls if and how per-configuration energies are included in the fit. Options are:

false:
- Do not include any energy information in the fit
true:
- Include/expect an energy for each configuration in the training trajectory(s)
FIRST <integer>:
- Include/expect an energy for the overall first <integer> configurations in the training trajectory(s)

`FITCOUL`

This keyword defines whether charges should be fit, or held fixed at user-defined values. Note that currently, functionality is only supported when FITCOUL is true, or when FITCOUL is false and all charges are zero. If FITCOUL is false, but charges are non-zero, program will attempt to subtract charge contributions from forces.

`CHBTYPE`

This keyword defined the pair distance transformation method (i.e. for compatibility with the [-1,1] domain over which Chebyshev polynomials are defined). Currently, the only options are MORSE and DEFAULT (i.e. “direct”). See reference “Water-1” in Citing ChIMES for additional details. If MORSE is selected, meaningful values of MORSE_LAMBDA should be specified.

`EXCLUDE`

Interactions corresponding to specific atom triplets can be excluded from the fitting process by including the following lines above the NATMTYP entry in the input file, e.g. to exclude O-O-O (comprised of OO, OO, and OO atom pairs) and C-O-O 3-body interactions (comprised of CO, CO, and OO pairs) from the fit:

EXCLUDE 3B INTERACTION: 2
O O O
C O O

Note: under certain circumstances, users might want to exclude 1- and 2-body interactions, see more details on the ChIMES Active Learning Driver documentation.

`CHGCONS`

If the user desires to fit charges during the force matching process, they can either do so with no constraints (the default option), or by specifying n_atom_pairs -1 constraints. These constraints are added to the end of the input file, before # ENDFILE #. Take water, as an example; here we have 3 pair types, OO, OH, and HH. We want to enforce that the sum of all charges is zero, and that H has half the charge of O, and we can do so by adding the following lines:

CHARGE CONSTRAINTS:
OO  HH  OH  1000.0  -4000.0  0.0     0.0
OO  HH  OH  1000.0   4000.0  4000.0  0.0

The first line can be re-written as the equation: \(1000q_\mathrm{O}q_\mathrm{O} - 4000q_\mathrm{H}q_\mathrm{H} = 0\), and enforces the relationship that \(|q_\mathrm{O}| = 2|q_\mathrm{H}|\). The second line can be re-written as the equation: \(1000q_\mathrm{O}q_\mathrm{O}+ 4000q_\mathrm{H}q_\mathrm{H} + 4000q_\mathrm{O}q_\mathrm{H} = 0\), and enforces that the sign of \(q_\mathrm{OH}\) needs to be opposite of \(q_\mathrm{HH}\) and \(q_\mathrm{OO}\) (i.e. negative), and the relationship \(|q_\mathrm{OH}| = 2|q_\mathrm{HH}|\). Note that each line needs entries for all atom pairs.

`SPECIAL XB MINIM`

By default, many-body inner cutoffs are taken to be equivalent to the constituent 2-body inner cutoffs. One has the option of setting all inner cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup.in file:

SPECIAL 3B S_MINIM: ALL 0.0

Otherwise, each cutoff can be specified separately through syntax similar to the following:

SPECIAL 3B S_MINIM: SPECIFIC 4
OOOOOO OO OO OO 2.00000 2.00000 2.00000
OOOHOH OO OH OH 2.00000 0.80000 0.80000
HHOHOH OH OH HH 0.80000 0.80000 1.00000
HHHHHH HH HH HH 1.00000 1.00000 1.00000

Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same S_MINIM as the 2-body interactions, where constituent pairs determine the cutoff.

`SPECIAL XB MAXIM`

By default, many-body outer cutoffs are taken to be equivalent to the constituent 2-body outer cutoffs. One has the option of setting all outer cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup,in file:

SPECIAL 3B S_MAXIM: ALL 4.0

Otherwise, each cutoff can be specified separately through syntax similar to the following:

SPECIAL 3B S_MAXIM: SPECIFIC 4
CCCCCC CC CC CC 4.4 4.4 4.4
COCOCC CC CO CO 4.4 4.0 4.0
OOCOCO CO CO OO 4.0 4.0 6.5
OOOOOO OO OO OO 6.5 6.5 6.5

Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same S_MAXIM as the 2-body interactions, where constituent pairs determine the cutoff.

Note: ChIMES does not currently support many-body cutoffs to be larger than 2-body cutoffs. In addition, this specification is not compatible with LAMMPS.

`FCUTTYP`

This keyword specifies Chebyshev potential cutoff function types and corresponding parameters. FCUTTYP should be specified on its own line, under # PAIRIDX # entries. Currently supported options include CUBIC and TERSOFF <float>, where <float> gives the cutoff function kick-in distance as: \(r_\mathrm{cut} - <\mathrm{float}>r_\mathrm{cut}\), and should take on values between 0 and 1. For example, TERSOFF 0.3 modified the last 30% of the interaction.

Running

Generating the design matrix (`chimes_lsq`)

To generate a ChIMES design matrix and related metadata files, an input file, fm_setup.in must be created, and a training trajectory file(s) must be present at the location(s) specified by the TRJFILE option in fm_setup.in. With these files in place, the following command should be run:

/path/to/repo/build/chimes_lsq fm_setup.in > fm_setup.log

which will produce several output files, listed below. Note that if SPLITFI is set true in fm_setup.in, some files will be output as several file.<zero-padded-number>.txt rather than a single file.txt. dim*txt files contain additional information on this splitting.

Note that, for historical reasons, energy entries are repeated three times in A.txt, b.txt, b-labeled.txt, and natoms.txt files.

Output files

A.txt (The ChIMES design matrix)

For force- or force- and stress-only fits, columns contain each \(\frac{\partial X _{a,\mathrm{ChIMES} \{c\}}}{\partial c_{b}}\) value for a given coefficient \(c\) with index \(b\).
- If energies are included in the fit, the final columns are the number of atoms of each type in corresponding training trajectory frame
Each row corresponds to the \(i\), \(j\), or \(k^{\mathrm{th}}\) force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame. For easier mapping see the b-labeled.txt file
- Thus, 2-body only, force-only fit with polynomial order 10 for a 25 frame trajectory containing 300 atoms, A.txt would contain \(25 \times 300 \times 3 = 22,500\) rows and 10 columns.
If SPLITFI is true:
- Rather than a single A.txt file, several A.<zero-padded-number>.txt files are produced, which contain a subset of chimes design matrix rows. See dim.txt below for additional details.

b.txt (DFT forces, and optional stresses and energies)

This file contains the DFT per-atom forces and optionally configuration stresses and energies extracted from the training trajectory
Each row corresponds to the \(i\), \(j\), or \(k^{\mathrm{th}}\) force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame.
- Thus, 2-body only, force-only fit for a 25 frame trajectory containing 300 atoms, b.txt would contain \(25 \times 300 \times 3 = 22,500\) rows, irrespective of model complexity (i.e. bodiedness and polynomial order).
If SPLITFI is true:
- Rather than a single b.txt file, several b.<zero-padded-number>.txt files are produced, which contain a subset of b.txt rows. See dim.txt below for additional details.

b-labeled.txt (DFT forces, and optional stresses and energies, labeled)

This file is a replicate of b.txt which contains additional information on each entry. If the row corresponds to a force, the first and second column entries will provide atom chemical symbol and corresponding DFT force. For energies, rows are label with a “+1”, and for stresses, “\(s\_ab\)”, where ab indicates the corresponding tensor component (e.g. xx, yy, zz, etc.).
Additional annotation in the b-labeled.txt file can be realized by decoration of the optional traj_list.dat file, when the fm_setup.in keyword TRJFILE uses the MULTI option.
If SPLITFI is true:
- Rather than a single b-labeled.txt file, several b-labeled.<zero-padded-number>.txt files are produced, which contain a subset of b-labeled.txt rows. See dim.txt below for additional details.

dim.txt (Output file dimensions)

If SPLITFI is false:
- The first value in dim.txt is the number of columns in A.txt, and the second is the number of rows in A.txt, b.txt, b-labeled.txt, and natoms.txt.
If SPLITFI is true:
- A.txt, b.txt, b-labeled.txt, and natoms.txt are split into sub-files containing approximately the same number of lines, named like file.<zero-padded-number>.txt, e.g. A.0001.txt. Corresponding dim.<zero-padded-number>.txt files are produced, where the first value is the number of columns in the A.<zero-padded-number>.txt (which is the same in all files), and the last value is the total number of rows that would be present in each of A.txt, b.txt, b-labeled.txt, and natoms.txt if they were output in thier unsplit form. Finally, the second and third columns in file.<zero-padded-number>.txt indicate the first and last row of the unsplit file contained in file.<zero-padded-number>.txt.

natoms.txt (Number of atoms in training trajectory)

If SPLITFI is false:
- For each row in A.txt, b.txt, and b-labeled.txt, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities.
If SPLITFI is true:
- For each row in A.<zero-padded-number>.txt, b.<zero-padded-number>.txt, and b-labeled.<zero-padded-number>.txt, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities.

params.header (metadata)

Contains metadata (e.g. hyperparameters, headers, etc.) from fm_setup.in.

ff_groups.map (metadata)

Contains metadata (e.g. mapping between many-body atom types and thier corresponding integer type).

Tips and tricks

Inner cutoffs are typically set to the lowest sampled distance for each given pair type in the training trajectory
For molecular systems:
- 2-body outer cutoffs are usually set to encompass at least the second non-bonded solvation shell (determined by examination of the DFT-MD generated radial pair distribution functions, RDFs), or around 8 Å
- 3-body outer cutoffs are usually set to encompass the first non-bonded solvation shell
- 4-body outer cutoffs are usually set to between the first or second RDF minimum; choice depends on the nature of the system (i.e. do we need to describe extended 4-body interactions like molecular torsions, or more short-ranged interactions like those found in small molecules?
The “Morse parameter” is usually set to the distance corresponding to the the first (bonding) peak in the in the pair RDF
There is no hard-and-fast rule for setting the 2/3/4-body polynomial orders, but 12/7/3 is generally a good starting point
Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress
If a Tersoff-style cutoff function is used, a Tersoff variable of between 0.25 and 0.75 is reasonable, where larger values modify the interaction at shorter distances, but can make it easier to obtain a smooth interaction
Be mindful of outer cutoff distances with respect to box lengths. Training data are typically generated with small (i.e. DFT-friendly) cells
If you don’t want to determine all pair/triplet/quadruplet interaction types and inner cutoffs for fm_setup.in by hand, setup a dummy fm_setup.in file with:
- # PAIRTYP# set to CHEBYSHEV 2 2 2 -1 1,
- 2-body inner cutoffs = 0.0
- 2-body outer cutoffs = something tiny (i.e. 0.5)
- Do not explicitly set 3- or 4-body outer cutoffs (the code will use the 2-body value for them)
- Run the chimes_lsq code using this file
- Check the bottom section of the output, which lists all this information for you

Solving for ChIMES parameters

Once a ChIMES design matrix has been generated, parameters can be obtained by running chimes_lsq.py. Note that this script depends on native numpy and scipy, installations for python3.x, and has numerous features, detailed below. If chimes_lsq was run with SPLITFI set to false and no weighting is desired, one can solve for ChIMES parameters using principal component analysis based on the singular value decomposition of the design matrix (“SVD”) with default regularization (1.0e-05) via:

python3.x /path/to/repo/build/chimes_lsq.py > params.txt

however we note that chimes_lsq supports many other solvers, summarized below:

Supported solvers

Note: Asterisks (*) indicate options described in greater detail below

Method	Additional dependencies	Related `chimes_lsq.py` flags
SVD	None	`--eps` `--weights`
Fast SVD	None	`--eps` `--weights`
LASSO	sklearn	`--alpha` `--normalize` `--weights`
LASSOLARS	sklearn	`--alpha` `--normalize` `--weights`
DLARS*	DLARS	`--alpha` `--normalize` `--weights` (`--dlasso_dlars_path`) `--nodes` `--cores` `--read_output` `--restart` `--split_files`
DLASSO*	DLARS	`--alpha` `--normalize` `--weights` (`--dlasso_dlars_path`) `--nodes` `--cores` `--read_output` `--restart` `--split_files`

Using `DLARS` and `DLASSO`

DLARS and DLASSO provided a capability for distributed design matrix solution. This feature is particularly useful for large training sets or high complexity fits, but is best reserved for runs on high-performance computers.

Options and flags

Flag	Option type	Default value	Description
`--A`	str	A.txt	Design matrix
`--algorithm`	str	svd	Fitting algorithm (Supported options: svd, lasso, lassolars, dlars, and dlasso)
`--dlasso_dlars_path`	str	N/A	Path to DLARS/DLASSO solver
`--alpha`	float	1.0e-04	Lasso or ridge regularization
`--b`	str	b.txt	Reference force file
`--cores`	int	8	DLARS/DLASSO total number of cores (i.e., nodes * cores-per-node)
`--eps`	float	1.0e-05	SVD regularization
`--header`	str	params.header	Parameter file header
`--map`	str	ff_groups.map	Parameter file map
`--nodes`	int	1	DLARS/DLASSO number of nodes
`--normalize`	bool	False	Normalize DLARS/DLASSO calculation
`--read_output`	bool	False	Read output from previous DLARS run
`--restart_dlasso_dlars`	str	N/A	Determines whether dlasso or dlars job will be restarted. Argument is the restart file name
`--split_files`	bool	False	LSQ code has split A matrix output (DLARS/DLASSO)
`--test_suite`	bool	False	Output for test suite
`--weights`	str	N/A	Weight file
`--active`	bool	False	Is this a DLARS/DLASSO run from the active learning driver?

Choosing solvers

All solvers support regularization (i.e. via --eps or --alpha), which can aid overfitting mitigation. For LASSO, LASSOLARS, DLARS, and DLASSO solvers, regularization has the added benefit of automatically setting minimally-informative parameters to zero, helping minimize model size and increase efficiency. Note that these zeroed parameters can be scrubbed from the parameter file with post_proc_chimes_lsq.py, e.g.:

python3.x /path/to/repo/build/post_proc_chimes_lsq.py <parameter_file>

which produces a new parameter file named <parameter_file>.reduced. Doing so speeds up the sum over Chebyshev coefficients increasing computational efficiency.

Weighting

Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress. Weights can be rapidly generated through:

wF=1.0
wE=0.1
wS=100.0

paste b-labeled.txt | awk -v wF="$wF" -v wE="$wE" -v wS="$wS" '{if($1=="+1"){print wE}else if($1~"s_"){print wS}else{print wF}}' > weights.txt

Anatomy of ChIMES parameter file

Variables	Descriptions
`Number of variables`	The maximum number of coefficients
`Number of equations`	The number of equations to solve
`eps`	The threshold for dropping small singular values in SVD
`SVD regularization factor`	The regularization parameter for SVD algorithm
`alpha`	The regularization parameter for LASSO regularization
`RMS Force error`	The global root mean square error between training data and ChIMES predictions
`max abs variable`	The absolute value of the largest coefficient
`number of fitting vars`	The number of variables after regularization
`Bayesian Information Criterion`	The model selection criterion
`Using weighting file`	The file containing weights for fitting
`USECOUL`	If the model accounts for system charges
`FITCOUL`	If the model is fit to system charges
`USE3BCH`	If the model is fit to 3-body interactions
`USE4BCH`	If the model is fit to 4-body interactions
`EXCLD1B`	If the model excludes 1-body energy
`EXCLD2B`	If the model excludes 2-body interactions
`PAIRTYP`	2-, 3-, and 4-body Chebyshev polynomial orders
`ATOM TYPES`	The number of atom types
`TYPEIDX`	The index of atom type
`ATM_TYP`	The label of atom type
`ATMCHRG`	The charge of atom type
`ATMMASS`	The mass of atom type
`ATOM PAIRS`	The number of pair types
`PAIRIDX`	The index of the pair type
`ATM_TY1`	The label of atom 1 in the pair
`ATM_TY2`	The label of atom 2 in the pair
`S_MINIM`	The inner cutoff for the pair
`S_MAXIM`	The outer cutoff for the pair
`CHBDIST`	The transformation style for pair distance
`MORSE_LAMBDA`	The parameter used in Morse transformation
`FCUT TYPE`	The type of cutoff function used
`SPECIAL 3B S_MAXIM`	Special outer cutoff for 3-body interactions
`ATOM PAIR TRIPLETS`	The number of triplet types
`ATOM PAIR QUARDRUPLETS`	The number of quadruplet types
`PAIRTYPE PARAMS`	The pair type index and atom types
`TRIPLETTYPE PARAMS`	The triplet type index and atom types; see code block for details
`PAIRMAPS`	The permutations of pair types
`TRIPMAPS`	The permutations of triplet types
`NO ENERGY OFFSETS`	The number of atoms types
`ENERGY OFFSET X`	The energy offset of atom type X

Here we provide an example of the triplet parameters.

TRIPLET CHEBYSHEV PARAMS
TRIPLETTYPE PARAMS:
INDEX: 0 ATOMS: Si Si Si
PAIRS: SiSi SiSi SiSi UNIQUE: 665 TOTAL: 3332
    index  |  powers  |  equiv index  |  param index  |       parameter
----------------------------------------------------------------------------
    0       0  1  1         0               0          2.1714984512050e+02
    1       1  0  1         0               0          2.1714984512050e+02
    2       1  1  0         0               0          2.1714984512050e+02
    3       0  1  2         3               1          6.2564844297350e+01
    4       1  0  2         3               1          6.2564844297350e+01
    5       0  2  1         3               1          6.2564844297350e+01
    6       2  0  1         3               1          6.2564844297350e+01
    7       1  2  0         3               1          6.2564844297350e+01
    8       2  1  0         3               1          6.2564844297350e+01
    9       0  1  3         9               2          3.4634223883030e+01
    10      1  0  3         9               2          3.4634223883030e+01

Line 3 specifies the index of the triplet and the atoms in the triplet. Line 4 specifies all pairs in the triplet, lists the number of unique triplet power index and the total number of triple power index.
Line 4 specifies the pair combinations used in the triplet block
Line 5:
- index: Specifies the index of the coefficient in the triplet sum
- powers: The powers of the three Chebyshev polynomials
- equiv index: The index of the first instance of the permutationally invariant coefficient
- param index: The unique coefficient index based of equiv index
- parameter: The value of the coefficients

Tips and tricks

Weighting and regularization can all impact quality of resulting models, and their values should be considered model hyperparameters and carefully explored for each fitting problem.
For LARS or LASSO-based solvers, regularization of 1.0e-2 or 1.0e-5 are reasonable starting points for un-normalized and normalized fits, respectively
Running chimes_lsq.py can be memory intensive. Check the size of A.txt relative to the available memory (RAM) on the workhorse machine prior to running.

Checking fit quality

Though it is critical to evaluate model performance on the basis of physical predictions, it can be helpful to evaluate model performance through target force, energy, or stress recovery. This can be easily obtained through:

paste b-labeled.txt force.txt | awk '{if($1=="+1")              {print ($2,$3)}}' > compare-energies.txt
paste b-labeled.txt force.txt | awk '{if($1~ "s_")              {print ($2,$3)}}' > compare-stresses.txt
paste b-labeled.txt force.txt | awk '{if(($1!~"s_")&&($1!="+1")){print ($2,$3)}}' > compare-forces.txt

xmgrace compare*txt

If optional <F_flag>, <S_flag>, and <E_flag> flags in traj_list.dat yield more information in b-labeled.txt from which selections can be parsed, as can consideration of natoms.txt.

Common Issues

My model is requiring SO many iterations to become stable

Have you set reasonable inner cutoffs (i.e. 0.02 to 0.002 less than the smallest observed distance, set individually for each pair type)?
Have you remembered to set the penalty parameters (i.e. \(A_{\mathrm{p}}\) from 1.0e5 to 1.0e6, \(d_{\mathrm{p}}\) from 0.01 to 0.05)?
Have you tried using a smaller timestep during the early training stages? (e.g., if DFT simulations for a C/H/O/N system used a 0.5 fs timestep)?
- ChIMES training simulations should use something closer to 0.1 – 0.2
- Small timesteps allow the ChIMES simulations to remain stable for longer, which can help make iteratively-obtained training data more informative

My model is yielding very bad RDFs/is missing significant RDF features

Have you accidentally grabbed only sequential configurations?
Have you watched your DFT trajectory to make sure the system exhibits significant evolution?
- For example physical properties of a melted material will generally not be recovered from training to only solid phase data
Have you checked that your actual training trajectory contains those features? For example, you can quickly check by computing your RDFs directly from the training trajectory and checking you peaks in all the expected positions.

My model just isn’t as good as expected

Are you using an appropriate bodiedness/polynomial order?
- Holdout cross-validation on the original training set can give you an idea of what to use.
Are your cutoffs appropriate?
- Cutoffs cannot be arbitrarily short (they should encapsulate the relevant physics) or long (they should preclude self-interaction across the training periodic boundary conditions)

Caveats

ChIMES models are not inherently transferable - training set generation should be carefully considered prior to model generation
High-complexity (e.g. bodiedness and polynomial order) models for complex molecular materials (e.g. C-containing) will not generally be stable for dynamics the first time through - in these cases, iterative model fitting is usually required. See link for additional details.
ChIMES will attempt to generate single-atom energies - these values cannot be accurately determined unless the training set contains frames with varied atomic composition, otherwise, ChIMES assigns one atom the sum of single-atom energies for each atom type.

Generating a ChIMES model

Essential workflow

1. Reference data preparation

2. Hyperparameter selection

3. Design matrix generation

4. Design matrix optimization

5. Performing calculations with ChIMES

Input files and options

Additional details on:

TRJFILE

FITSTRS

FITENER

FITCOUL

CHBTYPE

EXCLUDE

CHGCONS

SPECIAL XB MINIM

SPECIAL XB MAXIM

FCUTTYP

Running

Generating the design matrix (chimes_lsq)

Output files

Tips and tricks

Solving for ChIMES parameters

Supported solvers

Using DLARS and DLASSO

Options and flags

Choosing solvers

Weighting

Anatomy of ChIMES parameter file

Tips and tricks

Checking fit quality

Common Issues

Caveats

`TRJFILE`

`FITSTRS`

`FITENER`

`FITCOUL`

`CHBTYPE`

`EXCLUDE`

`CHGCONS`

`SPECIAL XB MINIM`

`SPECIAL XB MAXIM`

`FCUTTYP`

Generating the design matrix (`chimes_lsq`)

Using `DLARS` and `DLASSO`