Generating a ChIMES model
This section assumes the goal of obtaining an atomistic ChIMES model from Kohn-Sham density functional theory (DFT) molecular dynamics (MD) generated reference data. However, ChIMES models can be fit to a broad variety of reference calculations. For additional information see Citing ChIMES.
At its simplest, ChIMES models are generated by per-atom force matching to DFT reference data, and the ChIMES Parameter Generator, chimes_lsq, aims to minimize the following objective function:
where \(\Delta F = F^{\mathrm{DFT}} - F^{\mathrm{ChIMES\{c\}}}\). \(F_{\mathrm{obj}}\), and \(\{c\}\) are the weighted root-mean-squared error (RMSE) and model coefficients, respectively, and the number of frames and atoms are given by \(n_\mathrm{f}\) and \(n_\mathrm{a}\), respectively. \(F_{ijk}\) indicates the \(k^{\mathrm{th}}\) Cartesian component of the force on atom \(j\) in configuration \(i\). The superscripts “ChIMES” and “DFT” indicate forces/energies predicted from the force-matched model and the DFT-MD training trajectory, respectively.
Because ChIMES is entirely linear it its fitted parameters, the above optimization problem can be recast as the following over-determined matrix equation:
where \(\mathrm{\mathbf{b}_{DFT}}\) is the vector of DFT-generated per-atom force components, \(F^{\mathrm{DFT}}_{ijk}\), \(\mathrm{\mathbf{w}}\) is a diagonal matrix of weights to be applied to the elements of \(\mathrm{\mathbf{b}_{DFT}}\) and rows of \(\mathrm{\mathbf{M}}\), \(\mathrm{\mathbf{c}}\) is the diagonal matrix of generated model parameters, and \(\mathrm{\mathbf{M}}\) is the overdetermined ChIMES design matrix for which elements of \(\mathrm{\mathbf{M}}\) are given by:
Please note that the matrix \(\mathrm{\mathbf{M}}\) is referred to as matrix \(\mathrm{\mathbf{A}}\) in the software and reported as A.txt.
As discussed in further detail below, we note that per-configuration energies and stress tensors can be considered in ChIMES fits, in additional to per-atom forces.
Additional details are provided below in the following sections:
Essential workflow
The basic ChIMES model development workflow comprises 5 main steps, briefly outlined below. Note:
1. Reference data preparation
DFT-MD simulations are run for the target material over the target thermodynamic condition(s)
Per-atom forces and optionally the system energy and stress tensor are extracted from the trajectory and saved in a “training trajectory” file, in a
chimes_lsqcompatible format
2. Hyperparameter selection
ChIMES model hyperparameters (e.g. bodiedness, polynomial orders, cutoffs, transformation function parameters, etc.) are specified in the
chimes_lsqinput file (e.g., typically namedfm_setup.in), along with the training trajectory file
3. Design matrix generation
chimes_lsqis executed to produce the design matrix (A.txt), the vector of forces (and optionally energies and stresses) from DFT (b.txt), and other metadata files (e.g.params.header,ff_groups.map)
4. Design matrix optimization
ChIMES paramteters are generated via the
chimes_lsq.pyutility, which also allows for consideration of an optional weight
5. Performing calculations with ChIMES
ChIMES calculations and simulations are performed through the ChIMES Calculator, which can be linked to a variety of simulation software packages
Input files and options
Note: Asterisks (*) indicate options described in greater detail below
Control variables |
Description |
Value/Options/Notes |
|---|---|---|
|
Training trajectory file(s) |
See below for details. |
|
|
Automatically disabled when ghost atoms (layers) are used. |
|
|
Should not be used unless DLARS/DLASSO solvers are used. |
|
Number of training frames |
Any integer > 0. |
|
Number of supercell ghost layers |
A value of 0 yields the original box. A value of 1 yields a single shell of replicated boxes around the original box (i.e. 27 boxes). |
|
|
See below for details. |
|
Whether/how to include stresses |
See below for details. |
|
Whether/how to include energies |
See below for details. |
|
Chebyshev polynomial orders |
Expects |
|
Pair distance transformation type |
See below for details. |
|
Neighbor list/distance convention |
Auto-select algorithm ( |
|
ReaxFF linear overcoordination parameters use |
Boolean to decide if using ReaxFF linear overcoordination parameters. If this is false and parameters are provided below, those parameters will be subtracted from forces. |
|
Skip frames to train to |
Decide to process frames in round-robin order with skipping. Integer input. |
Note: Asterisks (*) indicate options described in greater detail below
Topology variables |
Description |
Value/Options/Notes |
|---|---|---|
|
Number of atom types |
Number of unique atom types in trajectory. |
|
Atom type index |
Integers, ranging from 1 to |
|
Atom type chemical symbol |
Any string. Chemical symbol for each unique atom type. Values form a column below |
|
Atom type charge |
If |
|
Atom type mass |
Floats > 0. |
|
Interaction to exclude/ignore |
Sets of n-body |
|
Pair type index |
Ascending integers from 1 to the number of unique atom pair types. Values form a column below |
|
|
Used to define interaction pair types. Order does not matter. |
|
Inner cutoff for pair |
Generally taken as slightly less than smallest distance sampled in DFT-MD trajectory. |
|
Outer cutoff for pair |
Should be small enough to prevent self-interaction across periodic boundaries. |
|
Ancillary. Value required |
Must have floating point number in this location. |
|
|
Morse-type lambda for Morse |
|
ReaxFF overbonding term |
Boolean. If |
|
Ancillary |
Ancillary |
|
Charge fitting constraints |
See below for details. |
|
Special manybody inner cutoffs |
See below for details. |
|
Special manybody outer cutoffs |
See below for details. |
|
Cutoff function style/parameters |
See below for details. |
Additional details on:
TRJFILE
This keyword provides the name of the simulation trajectory file. Files use a .xyzf format, which is like the standard .xyz format, with two exceptions: (1) the line after that specifying number of atoms will contain information on box dimensions and optionally stress tensor and system energy, and (2) each coordinate line has x, y, and z forces on the corresponding atom appended. Options are:
1. <any string>: Tell the program to search for a single trajectory file name.
2. MULTI <any string>: Tell the program to expect multiple trajectory files. Here, <any string> is the name of a file, e.g. traj_list.dat structured like:
<nfiles>
<frames to read> path/to/file-1.xyzf
<frames to read> path/to/file-2.xyzf
...
<frames to read> path/to/file-n.xyzf
Note that, when using the MULTI option, # NFRAMES # must be equal to the sum of <frames to read> for each of the <nfiles>.
As is described in the Output files section below, traj_list.dat can contain optional decoration which is useful for improving output annotation and/or automating fitting work flows. For example:
<nfiles>
<frames to read> path/to/file-1.xyzf <F_flag> <S_flag> <E_flag>
<frames to read> path/to/file-2.xyzf <string> <F_flag> <S_flag> <E_flag>
...
<frames to read> path/to/file-n.xyzf <string>
where <string>, <F_flag>, <S_flag>, and <E_flag> represent an optional string ignored by the program, and strings to be prepended to labels in b-labeled.txt for force, stress, and energy entries, respectively. Note that none of these annotations can contain whitespace, and that annotation order cannot be randomized.
Comment lines in trajectory file frames are formatted like:
<box x-len> <box y-len> <box z-len> <s_xx> <s_yy> <s_zz> <s_xy> <s_xz> <s_yz> <energy>
or, if the first word on the line is “NON-ORTHO”:
NON-ORTHO <latvec-1_x> <latvec-1_y> <latvec-1_z> <latvec-2_x> <latvec-2_y> <latvec-2_z> <latvec-3_x> <latvec-3_y> <latvec-3_z> <s_xx> <s_yy> <s_zz> <s_xy> <s_xz> <s_yz> <energy>
where \(s\_ab\) are the \(ab\) stress tensors and energy is the overall system energy, and \(\mathrm{latvec-}i\_a\) is the \(a^{\rm{th}}\) component of the \(i^{\rm{th}}\) lattice vector.
Note that stress tensors and energies are optional and thier inclusion is indicated by FITSTRS and FITENER in the fm_setup.in file. Additional details can be found in the corresponding sections below.
FITSTRS
This keyword controls if and how per-configuration stress tensors are included in the fit. Options are:
false:Do not include any stress tensor information in the fit
true:Fit to stress tensors for all training trajectory frames, but only expect/consider \(xx\), \(yy\), and \(zz\) stress tensor components
ALL:Fit to stress tensors for all training trajectory frames, and expect/consider all stress tensor components
FIRST <integer>:Fit to stress tensors for the overall first
<integer>training trajectory frames, but only expect/consider \(xx\), \(yy\), and \(zz\)
FIRSTALL <integer>:Fit to stress tensors for the overall first
<integer>training trajectory frames, and expect/consider all stress tensor components
FITENER
This keyword controls if and how per-configuration energies are included in the fit. Options are:
false:Do not include any energy information in the fit
true:Include/expect an energy for each configuration in the training trajectory(s)
FIRST <integer>:Include/expect an energy for the overall first
<integer>configurations in the training trajectory(s)
FITCOUL
This keyword defines whether charges should be fit, or held fixed at user-defined values. Note that currently, functionality is only supported when FITCOUL is true, or when FITCOUL is false and all charges are zero. If FITCOUL is false, but charges are non-zero, program will attempt to subtract charge contributions from forces.
CHBTYPE
This keyword defined the pair distance transformation method (i.e. for compatibility with the [-1,1] domain over which Chebyshev polynomials are defined). Currently, the only options are MORSE and DEFAULT (i.e. “direct”). See reference “Water-1” in Citing ChIMES for additional details. If MORSE is selected, meaningful values of MORSE_LAMBDA should be specified.
EXCLUDE
Interactions corresponding to specific atom triplets can be excluded from the fitting process by including the following lines above the NATMTYP entry in the input file, e.g. to exclude O-O-O (comprised of OO, OO, and OO atom pairs) and C-O-O 3-body interactions (comprised of CO, CO, and OO pairs) from the fit:
EXCLUDE 3B INTERACTION: 2
O O O
C O O
Note: under certain circumstances, users might want to exclude 1- and 2-body interactions, see more details on the ChIMES Active Learning Driver documentation.
CHGCONS
If the user desires to fit charges during the force matching process, they can either do so with no constraints (the default option), or by specifying n_atom_pairs -1 constraints. These constraints are added to the end of the input file, before # ENDFILE #. Take water, as an example; here we have 3 pair types, OO, OH, and HH. We want to enforce that the sum of all charges is zero, and that H has half the charge of O, and we can do so by adding the following lines:
CHARGE CONSTRAINTS:
OO HH OH 1000.0 -4000.0 0.0 0.0
OO HH OH 1000.0 4000.0 4000.0 0.0
The first line can be re-written as the equation: \(1000q_\mathrm{O}q_\mathrm{O} - 4000q_\mathrm{H}q_\mathrm{H} = 0\), and enforces the relationship that \(|q_\mathrm{O}| = 2|q_\mathrm{H}|\). The second line can be re-written as the equation: \(1000q_\mathrm{O}q_\mathrm{O}+ 4000q_\mathrm{H}q_\mathrm{H} + 4000q_\mathrm{O}q_\mathrm{H} = 0\), and enforces that the sign of \(q_\mathrm{OH}\) needs to be opposite of \(q_\mathrm{HH}\) and \(q_\mathrm{OO}\) (i.e. negative), and the relationship \(|q_\mathrm{OH}| = 2|q_\mathrm{HH}|\). Note that each line needs entries for all atom pairs.
SPECIAL XB MINIM
By default, many-body inner cutoffs are taken to be equivalent to the constituent 2-body inner cutoffs. One has the option of setting all inner cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup.in file:
SPECIAL 3B S_MINIM: ALL 0.0
Otherwise, each cutoff can be specified separately through syntax similar to the following:
SPECIAL 3B S_MINIM: SPECIFIC 4
OOOOOO OO OO OO 2.00000 2.00000 2.00000
OOOHOH OO OH OH 2.00000 0.80000 0.80000
HHOHOH OH OH HH 0.80000 0.80000 1.00000
HHHHHH HH HH HH 1.00000 1.00000 1.00000
Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same S_MINIM as the 2-body interactions, where constituent pairs determine the cutoff.
SPECIAL XB MAXIM
By default, many-body outer cutoffs are taken to be equivalent to the constituent 2-body outer cutoffs. One has the option of setting all outer cutoffs of a given bodiedness (e.g. 3-body) to an equivalent value by adding the following line to the end of the fm_setup,in file:
SPECIAL 3B S_MAXIM: ALL 4.0
Otherwise, each cutoff can be specified separately through syntax similar to the following:
SPECIAL 3B S_MAXIM: SPECIFIC 4
CCCCCC CC CC CC 4.4 4.4 4.4
COCOCC CC CO CO 4.4 4.0 4.0
OOCOCO CO CO OO 4.0 4.0 6.5
OOOOOO OO OO OO 6.5 6.5 6.5
Where the “4” is the number of cutoffs to be listed. For example, for 3-body interactions, the first column specifies the syntax order of all three pairs. The next three columns specifies the syntax of all three pairs. The last three columns specifies the special cutoff values. Any many-body type for which a line is not provided will use the same S_MAXIM as the 2-body interactions, where constituent pairs determine the cutoff.
Note: ChIMES does not currently support many-body cutoffs to be larger than 2-body cutoffs. In addition, this specification is not compatible with LAMMPS.
FCUTTYP
This keyword specifies Chebyshev potential cutoff function types and corresponding parameters. FCUTTYP should be specified on its own line, under # PAIRIDX # entries. Currently supported options include CUBIC and TERSOFF <float>, where <float> gives the cutoff function kick-in distance as: \(r_\mathrm{cut} - <\mathrm{float}>r_\mathrm{cut}\), and should take on values between 0 and 1. For example, TERSOFF 0.3 modified the last 30% of the interaction.
Running
Generating the design matrix (chimes_lsq)
To generate a ChIMES design matrix and related metadata files, an input file, fm_setup.in must be created, and a training trajectory file(s) must be present at the location(s) specified by the TRJFILE option in fm_setup.in. With these files in place, the following command should be run:
/path/to/repo/build/chimes_lsq fm_setup.in > fm_setup.log
which will produce several output files, listed below. Note that if SPLITFI is set true in fm_setup.in, some files will be output as several file.<zero-padded-number>.txt rather than a single file.txt. dim*txt files contain additional information on this splitting.
Note that, for historical reasons, energy entries are repeated three times in A.txt, b.txt, b-labeled.txt, and natoms.txt files.
Output files
A.txt (The ChIMES design matrix)
For force- or force- and stress-only fits, columns contain each \(\frac{\partial X _{a,\mathrm{ChIMES} \{c\}}}{\partial c_{b}}\) value for a given coefficient \(c\) with index \(b\).
If energies are included in the fit, the final columns are the number of atoms of each type in corresponding training trajectory frame
Each row corresponds to the \(i\), \(j\), or \(k^{\mathrm{th}}\) force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame. For easier mapping see the b-labeled.txt file
Thus, 2-body only, force-only fit with polynomial order 10 for a 25 frame trajectory containing 300 atoms,
A.txtwould contain \(25 \times 300 \times 3 = 22,500\) rows and 10 columns.
If
SPLITFIis true:Rather than a single
A.txtfile, severalA.<zero-padded-number>.txtfiles are produced, which contain a subset of chimes design matrix rows. Seedim.txtbelow for additional details.
b.txt (DFT forces, and optional stresses and energies)
This file contains the DFT per-atom forces and optionally configuration stresses and energies extracted from the training trajectory
Each row corresponds to the \(i\), \(j\), or \(k^{\mathrm{th}}\) force component of a given atom followed by the 0 (False), 3 (True), or 9 (ALL) stress components and the 0 (False) or 3 (True) energy values in the corresponding training trajectory frame.
Thus, 2-body only, force-only fit for a 25 frame trajectory containing 300 atoms,
b.txtwould contain \(25 \times 300 \times 3 = 22,500\) rows, irrespective of model complexity (i.e. bodiedness and polynomial order).
If
SPLITFIis true:Rather than a single
b.txtfile, severalb.<zero-padded-number>.txtfiles are produced, which contain a subset ofb.txtrows. Seedim.txtbelow for additional details.
b-labeled.txt (DFT forces, and optional stresses and energies, labeled)
This file is a replicate of b.txt which contains additional information on each entry. If the row corresponds to a force, the first and second column entries will provide atom chemical symbol and corresponding DFT force. For energies, rows are label with a “+1”, and for stresses, “\(s\_ab\)”, where ab indicates the corresponding tensor component (e.g. xx, yy, zz, etc.).
Additional annotation in the b-labeled.txt file can be realized by decoration of the optional
traj_list.datfile, when thefm_setup.inkeywordTRJFILEuses theMULTIoption.If
SPLITFIis true:Rather than a single
b-labeled.txtfile, severalb-labeled.<zero-padded-number>.txtfiles are produced, which contain a subset ofb-labeled.txtrows. Seedim.txtbelow for additional details.
dim.txt (Output file dimensions)
If
SPLITFIis false:The first value in
dim.txtis the number of columns inA.txt, and the second is the number of rows inA.txt,b.txt,b-labeled.txt, andnatoms.txt.
If
SPLITFIis true:A.txt,b.txt,b-labeled.txt, andnatoms.txtare split into sub-files containing approximately the same number of lines, named likefile.<zero-padded-number>.txt, e.g.A.0001.txt. Correspondingdim.<zero-padded-number>.txtfiles are produced, where the first value is the number of columns in theA.<zero-padded-number>.txt(which is the same in all files), and the last value is the total number of rows that would be present in each ofA.txt,b.txt,b-labeled.txt, andnatoms.txtif they were output in thier unsplit form. Finally, the second and third columns infile.<zero-padded-number>.txtindicate the first and last row of the unsplit file contained infile.<zero-padded-number>.txt.
natoms.txt (Number of atoms in training trajectory)
If
SPLITFIis false:For each row in
A.txt,b.txt, andb-labeled.txt, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities.
If
SPLITFIis true:For each row in
A.<zero-padded-number>.txt,b.<zero-padded-number>.txt, andb-labeled.<zero-padded-number>.txt, provides the number of atoms in the training trajectory frame from which information in the row originated. Useful for weighting or evaluating per-atom quantities.
params.header (metadata)
Contains metadata (e.g. hyperparameters, headers, etc.) from
fm_setup.in.
ff_groups.map (metadata)
Contains metadata (e.g. mapping between many-body atom types and thier corresponding integer type).
Tips and tricks
Inner cutoffs are typically set to the lowest sampled distance for each given pair type in the training trajectory
For molecular systems:
2-body outer cutoffs are usually set to encompass at least the second non-bonded solvation shell (determined by examination of the DFT-MD generated radial pair distribution functions, RDFs), or around 8 Å
3-body outer cutoffs are usually set to encompass the first non-bonded solvation shell
4-body outer cutoffs are usually set to between the first or second RDF minimum; choice depends on the nature of the system (i.e. do we need to describe extended 4-body interactions like molecular torsions, or more short-ranged interactions like those found in small molecules?
The “Morse parameter” is usually set to the distance corresponding to the the first (bonding) peak in the in the pair RDF
There is no hard-and-fast rule for setting the 2/3/4-body polynomial orders, but 12/7/3 is generally a good starting point
Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress
If a Tersoff-style cutoff function is used, a Tersoff variable of between 0.25 and 0.75 is reasonable, where larger values modify the interaction at shorter distances, but can make it easier to obtain a smooth interaction
Be mindful of outer cutoff distances with respect to box lengths. Training data are typically generated with small (i.e. DFT-friendly) cells
If you don’t want to determine all pair/triplet/quadruplet interaction types and inner cutoffs for
fm_setup.inby hand, setup a dummyfm_setup.infile with:# PAIRTYP#set toCHEBYSHEV 2 2 2 -1 1,2-body inner cutoffs = 0.0
2-body outer cutoffs = something tiny (i.e. 0.5)
Do not explicitly set 3- or 4-body outer cutoffs (the code will use the 2-body value for them)
Run the
chimes_lsqcode using this fileCheck the bottom section of the output, which lists all this information for you
Solving for ChIMES parameters
Once a ChIMES design matrix has been generated, parameters can be obtained by running chimes_lsq.py. Note that this script depends on native numpy and scipy, installations for python3.x, and has numerous features, detailed below. If chimes_lsq was run with SPLITFI set to false and no weighting is desired, one can solve for ChIMES parameters using principal component analysis based on the singular value decomposition of the design matrix (“SVD”) with default regularization (1.0e-05) via:
python3.x /path/to/repo/build/chimes_lsq.py > params.txt
however we note that chimes_lsq supports many other solvers, summarized below:
Supported solvers
Note: Asterisks (*) indicate options described in greater detail below
Method |
Additional dependencies |
Related |
|---|---|---|
SVD |
None |
|
Fast SVD |
None |
|
LASSO |
|
|
LASSOLARS |
|
|
DLARS* |
DLARS |
|
DLASSO* |
DLARS |
|
Using DLARS and DLASSO
DLARS and DLASSO provided a capability for distributed design matrix solution. This feature is particularly useful for large training sets or high complexity fits, but is best reserved for runs on high-performance computers.
Options and flags
Flag |
Option type |
Default value |
Description |
|---|---|---|---|
|
str |
A.txt |
Design matrix |
|
str |
svd |
Fitting algorithm (Supported options: svd, lasso, lassolars, dlars, and dlasso) |
|
str |
N/A |
Path to DLARS/DLASSO solver |
|
float |
1.0e-04 |
Lasso or ridge regularization |
|
str |
b.txt |
Reference force file |
|
int |
8 |
DLARS/DLASSO total number of cores (i.e., nodes * cores-per-node) |
|
float |
1.0e-05 |
SVD regularization |
|
str |
params.header |
Parameter file header |
|
str |
ff_groups.map |
Parameter file map |
|
int |
1 |
DLARS/DLASSO number of nodes |
|
bool |
False |
Normalize DLARS/DLASSO calculation |
|
bool |
False |
Read output from previous DLARS run |
|
str |
N/A |
Determines whether dlasso or dlars job will be restarted. Argument is the restart file name |
|
bool |
False |
LSQ code has split A matrix output (DLARS/DLASSO) |
|
bool |
False |
Output for test suite |
|
str |
N/A |
Weight file |
|
bool |
False |
Is this a DLARS/DLASSO run from the active learning driver? |
Choosing solvers
All solvers support regularization (i.e. via --eps or --alpha), which can aid overfitting mitigation. For LASSO, LASSOLARS, DLARS, and DLASSO solvers, regularization has the added benefit of automatically setting minimally-informative parameters to zero, helping minimize model size and increase efficiency. Note that these zeroed parameters can be scrubbed from the parameter file with post_proc_chimes_lsq.py, e.g.:
python3.x /path/to/repo/build/post_proc_chimes_lsq.py <parameter_file>
which produces a new parameter file named <parameter_file>.reduced. Doing so speeds up the sum over Chebyshev coefficients increasing computational efficiency.
Weighting
Weighting is often needed when energies and stresses tensors are included in the fit. Typical respective weighting factors are 0.1 – 5.0 for energy and 200 – 500 for stress. Weights can be rapidly generated through:
wF=1.0
wE=0.1
wS=100.0
paste b-labeled.txt | awk -v wF="$wF" -v wE="$wE" -v wS="$wS" '{if($1=="+1"){print wE}else if($1~"s_"){print wS}else{print wF}}' > weights.txt
Anatomy of ChIMES parameter file
Variables |
Descriptions |
|---|---|
|
The maximum number of coefficients |
|
The number of equations to solve |
|
The threshold for dropping small singular values in SVD |
|
The regularization parameter for SVD algorithm |
|
The regularization parameter for LASSO regularization |
|
The global root mean square error between training data and ChIMES predictions |
|
The absolute value of the largest coefficient |
|
The number of variables after regularization |
|
The model selection criterion |
|
The file containing weights for fitting |
|
If the model accounts for system charges |
|
If the model is fit to system charges |
|
If the model is fit to 3-body interactions |
|
If the model is fit to 4-body interactions |
|
If the model excludes 1-body energy |
|
If the model excludes 2-body interactions |
|
2-, 3-, and 4-body Chebyshev polynomial orders |
|
The number of atom types |
|
The index of atom type |
|
The label of atom type |
|
The charge of atom type |
|
The mass of atom type |
|
The number of pair types |
|
The index of the pair type |
|
The label of atom 1 in the pair |
|
The label of atom 2 in the pair |
|
The inner cutoff for the pair |
|
The outer cutoff for the pair |
|
The transformation style for pair distance |
|
The parameter used in Morse transformation |
|
The type of cutoff function used |
|
Special outer cutoff for 3-body interactions |
|
The number of triplet types |
|
The number of quadruplet types |
|
The pair type index and atom types |
|
The triplet type index and atom types; see code block for details |
|
The permutations of pair types |
|
The permutations of triplet types |
|
The number of atoms types |
|
The energy offset of atom type X |
Here we provide an example of the triplet parameters.
1TRIPLET CHEBYSHEV PARAMS
2TRIPLETTYPE PARAMS:
3INDEX: 0 ATOMS: Si Si Si
4PAIRS: SiSi SiSi SiSi UNIQUE: 665 TOTAL: 3332
5 index | powers | equiv index | param index | parameter
6----------------------------------------------------------------------------
7 0 0 1 1 0 0 2.1714984512050e+02
8 1 1 0 1 0 0 2.1714984512050e+02
9 2 1 1 0 0 0 2.1714984512050e+02
10 3 0 1 2 3 1 6.2564844297350e+01
11 4 1 0 2 3 1 6.2564844297350e+01
12 5 0 2 1 3 1 6.2564844297350e+01
13 6 2 0 1 3 1 6.2564844297350e+01
14 7 1 2 0 3 1 6.2564844297350e+01
15 8 2 1 0 3 1 6.2564844297350e+01
16 9 0 1 3 9 2 3.4634223883030e+01
17 10 1 0 3 9 2 3.4634223883030e+01
Line 3 specifies the index of the triplet and the atoms in the triplet. Line 4 specifies all pairs in the triplet, lists the number of unique triplet power index and the total number of triple power index.
Line 4 specifies the pair combinations used in the triplet block
Line 5:
index: Specifies the index of the coefficient in the triplet sum
powers: The powers of the three Chebyshev polynomials
equiv index: The index of the first instance of the permutationally invariant coefficient
param index: The unique coefficient index based of equiv index
parameter: The value of the coefficients
Tips and tricks
Weighting and regularization can all impact quality of resulting models, and their values should be considered model hyperparameters and carefully explored for each fitting problem.
For LARS or LASSO-based solvers, regularization of 1.0e-2 or 1.0e-5 are reasonable starting points for un-normalized and normalized fits, respectively
Running
chimes_lsq.pycan be memory intensive. Check the size of A.txt relative to the available memory (RAM) on the workhorse machine prior to running.
Checking fit quality
Though it is critical to evaluate model performance on the basis of physical predictions, it can be helpful to evaluate model performance through target force, energy, or stress recovery. This can be easily obtained through:
paste b-labeled.txt force.txt | awk '{if($1=="+1") {print ($2,$3)}}' > compare-energies.txt
paste b-labeled.txt force.txt | awk '{if($1~ "s_") {print ($2,$3)}}' > compare-stresses.txt
paste b-labeled.txt force.txt | awk '{if(($1!~"s_")&&($1!="+1")){print ($2,$3)}}' > compare-forces.txt
xmgrace compare*txt
If optional <F_flag>, <S_flag>, and <E_flag> flags in traj_list.dat yield more information in b-labeled.txt from which selections can be parsed, as can consideration of natoms.txt.
Common Issues
My model is requiring SO many iterations to become stable
Have you set reasonable inner cutoffs (i.e. 0.02 to 0.002 less than the smallest observed distance, set individually for each pair type)?
Have you remembered to set the penalty parameters (i.e. \(A_{\mathrm{p}}\) from 1.0e5 to 1.0e6, \(d_{\mathrm{p}}\) from 0.01 to 0.05)?
Have you tried using a smaller timestep during the early training stages? (e.g., if DFT simulations for a C/H/O/N system used a 0.5 fs timestep)?
ChIMES training simulations should use something closer to 0.1 – 0.2
Small timesteps allow the ChIMES simulations to remain stable for longer, which can help make iteratively-obtained training data more informative
My model is yielding very bad RDFs/is missing significant RDF features
Have you accidentally grabbed only sequential configurations?
Have you watched your DFT trajectory to make sure the system exhibits significant evolution?
For example physical properties of a melted material will generally not be recovered from training to only solid phase data
Have you checked that your actual training trajectory contains those features? For example, you can quickly check by computing your RDFs directly from the training trajectory and checking you peaks in all the expected positions.
My model just isn’t as good as expected
Are you using an appropriate bodiedness/polynomial order?
Holdout cross-validation on the original training set can give you an idea of what to use.
Are your cutoffs appropriate?
Cutoffs cannot be arbitrarily short (they should encapsulate the relevant physics) or long (they should preclude self-interaction across the training periodic boundary conditions)
Caveats
ChIMES models are not inherently transferable - training set generation should be carefully considered prior to model generation
High-complexity (e.g. bodiedness and polynomial order) models for complex molecular materials (e.g. C-containing) will not generally be stable for dynamics the first time through - in these cases, iterative model fitting is usually required. See link for additional details.
ChIMES will attempt to generate single-atom energies - these values cannot be accurately determined unless the training set contains frames with varied atomic composition, otherwise, ChIMES assigns one atom the sum of single-atom energies for each atom type.