-
Notifications
You must be signed in to change notification settings - Fork 0
AnalysisPackage
This is a package of numerous analysis methods.
-
Initialization
The purpose of this initialization is to set up some commonly used input/output directory and constants. -
Methods:
-
diss2csv()
generates a summary spreadsheet (csv format) for dissociation identification array. -
etot2csv()
generates a summary spreadsheet (csv format) for total energy array. -
calc_bond()
calculates bond length for the atom pairs defined by select array, which is a python list of atomic number pairs. -
angle_between()
calculates the angle between vector v1 and vector v2. -
angle_signed()
calculates the signed angle from vector v1 and vector v2 referenced to normal vector n. -
calc_angle()
calculates bond angles for the atom sets defined by select array, which is a python list of atomic number tuples containing three atom numbers, representing three points in space, A, B, C. And the angle will be defined as the angle from vector BA to vector BC. -
rotation_matrix_given_axis()
generates a 3-dimensional rotational matrix with a given rotation axis. -
rotation_matrix_target_vector()
uses the above method to generate a rotation matrix. -
cwt()
calculates continuous wavelet transformation (CWT) for trajectory data based on dissociation identification array.
-
diss2csv()
-
Subclass: class LinearRegression
This class is a simple implementation (gradient decent) of two basic machine learning algorithms, linear regression and logistic regression. Due to fundamental similarity between these two algorithms, logistic regression is treated as a variation of linear regression.- Initialization
- Attributes
- Methods:
-
init_training_set()
initializes training set. -
normlx()
does "feature scaling". -
train()
trains the parameters, theta, on a training set, using gradient descent algorithm. -
predict()
uses optimized theta array to predict y values by applying hypothesis function, h.
-
init_training_set()
- Example
- Initialization: AnalysisPackage(dir_in, dir_out, ref_energy=None, energy_unit="Hartree")
The purpose of this initialization is to set up some commonly used input/output directory and constants.- Parameters:
- dir_in: string
optional, default=None. Input directory from where the input arrays will be read. If nothing passed in, the current directory will be taken. - dir_out: string
optional, default=None. Output directory from where the output arrays will be written. If nothing passed in, the current directory will be taken. - ref_energy: float
optional, default=None. Reference energy value for some of the methods in this package. - energy_unit: string
optional, default="Hartree". String lable for energy unit used.
- dir_in: string
- Parameters:
- AnalysisPackage.diss2csv(diss=None, name=None, diss_time=True, time_step=0.25)
This method generates a summary spreadsheet (csv format) for dissociation identification array.- Parameters:
- diss: numpy array
optional, default=None. Dissociation identification array. If not passed in, the method assumes the input data will be read from disk by specifying name parameter. - name: string
optional, default=None. Dissociation identification array name. - diss_time: bool
optional, default=True. Turn on generating the dissociation time stamp summary. - time_step: float
optional, default=0.25. Time step size in whatever unit the original data was using.
- diss: numpy array
- Parameters:
- AnalysisPackage.etot2csv(etot=None, name=None, pinpoint=None, avg=False, ref_energy=None, unit_out="kcal/mol")
This method generates a summary spreadsheet (csv format) for total energy array.- Parameters:
- etot: numpy array, python, list, numpy npz package
optional, default=None. Total energy array. If not passed in, the method assumes the input data will be read from disk by specifying name parameter. - name: string
optional, default=None. Total energy array name. - pinpoint: numpy array
optional, default=None. Pinpoint array that contains the index of representing points along each trajectory in each dissociation channel. This is usually generated by the method DissociationDetect.pinpoint_gen() in DissociationDetect class. - avg: bool
optional, default=False. Take the average along a trajectory instead of using pinpoint array. - ref_energy: float
optional, default=None. Reference energy value so that the output energies will be relative energies with respect to this value. If not passed in, the reference energy value specified during the initialization of AnalysisPackage will be used. If neither was given, 0.0 will be taken. - unit_out: string
optional, default="kcal/mol". Output unit.
- etot: numpy array, python, list, numpy npz package
- Parameters:
- AnalysisPackage.calc_bond(select, xyz_raw=None, xyz_name=None, xyz_unit="Bohr", bond_unit="Angstrom", save=False)
This method calculates bond length for the atom pairs defined by select array, which is a python list of atomic number pairs.- Parameters:
- select: numpy array, python list of 2-element tuples/lists
first argument. This is a python list of atomic number pairs that defines the set of bond lengths needed to be calculated. - xyz_raw: numpy npz package
optional, default=None. Pass in the numpy npz package of Cartesian coordinates directly.It is assumed that this array has the following dimensionality: xyz_raw[..., atom, xyz]. - xyz_name: string
optional, default=None. Pass in the numpy npz package of Cartesian coordinates indirectly. The package will be read from dir_in+xyz_name. If both of these two attempts at obtaining Cartesian coordinates array failed, an error will be raised. It is also assumed that this array has the following dimensionality: xyz_raw[..., atom, xyz]. - xyz_unit: string
optional, default="Bohr". Input Cartesian coordinates unit. - bond_unit: string
optional, default="Angstrom". Output bond length unit. A conversion factor will be generated based on the choice of these two optional parameters. - save: bool
optional, default=False. Save the results into numpy npz package.
- select: numpy array, python list of 2-element tuples/lists
- Returns: a list of numpy arrays, each of which contains the bond lengths of selected atomic pairs for one trajectory.
- Parameters:
- AnalysisPackage.angle_between(v1, v2)
This method calculates the angle between vector v1 and vector v2. This is a vectorized implementation, therefore v1, v2, and the result can be multi-dimensional arrays.- Parameters:
- v1: numpy array
first argument. The first vector. - v2: numpy array
second argument. The second vector.
- v1: numpy array
- Returns: numpy array that contains the angle(s) between v1 and v2.
- Parameters:
- AnalysisPackage.angle_signed(v1, v2, n)
This method calculates the signed angle from vector v1 and vector v2 referenced to normal vector n. The sign of such angle is determined by comparing the cross product (2-D or 3-D cross product) of v1 and v2 with normal vector n. If this cross product is co-linear with normal vector n, the angle is positive; if anti-linear, negative. This is also a vectorized implementation, therefore v1, v2, n, and the result can be multi-dimensional arrays.- Parameters:
- v1: numpy array first argument. The first vector.
- v2: numpy array
second argument. The second vector. - n: python list, numpy array
third argument. The normal vector that the determination of signs of angles are referenced to.
- Returns: numpy array that contains the signed angle(s) from v1 to v2 with respect to normal vector n.
- Parameters:
- AnalysisPackage.calc_angle(select, xyz_raw=None, xyz_name=None, v_normal=None, signed=True, unit="degree", save=False)
This This method calculates bond angles for the atom sets defined by select array, which is a python list of atomic number tuples containing three atom numbers, representing three points in space, A, B, C. And the angle will be defined as the angle from vector BA to vector BC.- Parameters:
- select: numpy array, python list of 3-element tuples/lists
first argument. This is a python list of atomic number sets that defines the set of bond angles needed to be calculated. - xyz_raw: numpy npz package
optional, default=None. Pass in the numpy npz package of Cartesian coordinates directly.It is assumed that this array has the following dimensionality: xyz_raw[..., atom, xyz]. - xyz_name: string
optional, default=None. Pass in the numpy npz package of Cartesian coordinates indirectly. The package will be read from dir_in+xyz_name. If both of these two attempts at obtaining Cartesian coordinates array failed, an error will be raised. It is also assumed that this array has the following dimensionality: xyz_raw[..., atom, xyz]. - v_normal: python list, numpy array
optional, default=None. The normal vector to which the signed angle is calculated with respect. If not passed in, the Z-axis, i.e. vector [0, 0, 1] will be taken. This is only meaningful if signed angle is required to calculate instead of unsigned. - signed: bool
optional, default=True. Turn on signed angle calculation. - unit: string
optional, default="degree". The unit of output angles. The other option is "radian". A conversion factor will be generated based on this option. - save: bool
optional, default=False. Save the results into numpy npz package.
- select: numpy array, python list of 3-element tuples/lists
- Returns: a list of numpy arrays, each of which contains the bond angles of selected atomic sets for one trajectory.
- Parameters:
- AnalysisPackage.rotation_matrix_given_axis(axis_vector, angle2rotate)
The formula for this method is taken from Wikipedia at:
https://en.wikipedia.org/wiki/Rotation_matrix#cite_note-5
Which was original referenced at https://dspace.lboro.ac.uk/dspace-jspui/handle/2134/18050
The function of this method is to generate a 3-dimensional rotational matrix with a given rotation axis. The input axis_vector can be multi-dimensional, but the last axis needs to be (x,y,z). The number of dimension of angle2rotate has to be one less than that of axis_vector when not considering broadcasting. In other words, a multi-dimensional list of axis vectors should have the same dimensional list of rotation angle values. But since a vector inherently has one higher dimension than a scalar, hence the dimensionality requirement. The method automatically takes care of two types of broadcasting cases: 1. when axis_vector has the lowest possible dimension, i.e. 1-dimension; and 2. when angle2rotate has the lowest possible dimension, i.e. 0-dimension, a single scalar. Only for these two cases, the broadcasting is unambiguous without prior knowledge of the dimensionality of the two input arrays; in such cases, custom broadcasting could be done by feeding in this method with arrays that have singleton dimensions satisfactory for broadcasting. The implementation is vectorized.- Parameters:
- axis_vector: python list, numpy array
first argument. The vector(s) representing the rotation axis(axes). - angle2rotate: float, numpy array
second argument. The angle(s) of the rotation(s).
- axis_vector: python list, numpy array
- Returns: numpy array. The 3 by 3 rotation matrices are appended to the shared dimensions between two input arrays. That is, suppose we have axis_vector[a,b,...,n,3] and angle2rotate[a,b,...,n], the output array would be rot_mat[a,b,...,n,3,3].
- Parameters:
- AnalysisPackage.rotation_matrix_target_vector(vector1, vector2)
This method uses the above method to generate a rotation matrix. The axis of rotation is calculated by vector1 cross vector2, and the angle to rotate is the signed angle from vector1 to vector2. The result of such rotation should make vector1 coincide with vector2.- Parameters:
- vector1: numpy array
first argument. First vector. - vector2: numpy array
second argument. Second vector.
- vector1: numpy array
- Returns: numpy array. The 3 by 3 rotation matrices are appended to the shared dimensions between two input arrays.
- Parameters:
- AnalysisPackage.cwt(data=None, data_name=None, diss=None, diss_name="diss.npy", axis_time=0, data_reg=False,
threshold_reg=1000, msg_print=True, save=False)
This method calculates continuous wavelet transformation (CWT) for trajectory data based on dissociation identification array. For general usage of CWT, simply use the wavelets module directly.
The wavelets module is credited to https://github.com/aaren/wavelets by Aaron O'Leary.- Parameters:
- data: python list, numpy npz package
optional, default=None. Data array to have CWT performed on. If not passed in, the method assumes the input data will be read from disk by specifying data_name parameter. - data_name: string
optional, default=None. Data array name for directly reading from the disk. - diss: python list, numpy npz package
optional, default=None. Dissociation identification array. If not passed in, the method assumes the input data will be read from disk by specifying diss_name parameter. - diss_name: string
optional, default="diss.npy". Dissociation identification array name for directly reading from the disk. - axis_time: integer
optional, default=0. The number of dimension time steps are located. If the first dimension of the input data is not time steps, whichever specified will be rolled to the first dimension. - data_reg: bool
optional, default=False. Turn on data regulation. In some cases, the resulted CWT power spectra could have a few extremely large values resulting in significant distortion. This is especially common when averaging over many spectra for a set of trajectories. Discarding the problematic spectra is appropriate since such occurrence is usually a result of ill-behaving trajectory. - threshold_reg: float, integer, positive
optional, default=1000. The value of threshold for data regulation. - msg_print: bool
optional, default=True. Print out message during CWT calculations. This is useful when calculating a large number of separate CWTs. - save: bool
optional, default=False. Set to true if the various results need to be saved directly on disk.
- data: python list, numpy npz package
- Returns:
- scales: numpy array
first return. Wavelet transformation scales. - power_perch: python list
second return. Power spectra for each trajectory in each dissociation channel. - power_sum: numpy array
third return. Average power spectra over all trajectories in each dissociation channel.
- scales: numpy array
- Parameters:
-
class AnalysisPackage.LinearRegression(object):
This class is a simple implementation (gradient decent) of two basic machine learning algorithms, linear regression and logistic regression. Due to fundamental similarity between these two algorithms, logistic regression is treated as a variation of linear regression.
The idea of machine learning, more specifically supervised machine learning, is to "train" a set of parameters to obtain a mapping from a set features to a certain answer. To obtain such parameter, the simplest algorithm is gradient decent, being applied to a training set with pre-determined answers. And then such parameters can be used to predict answers given different sets of features.
The parameters are referred to as "theta", the features as "x", the answers as "y", and the hypothesis as "h". Then we would have:h(theta * x) = y. (Equation 1)For linear regression, h(theta * x) is simply the matrix multiplication, theta * x; for logistic regression, it has one more layer: suppose z = theta * x, h(theta * x) = 1/(1+exp(-z)).
If x and y are from training set, theta is optimized to reproduce Equation 1; if x is from actual problems with no answers, i.e. no y available, optimized theta is used to predict y by using Equation 1.- Initialization: AnalysisPackage.LinearRegression(x=None, y=None, random_init=True, feature_norm=True, cost_f=None,
hyp_f=None, grad_f=None)
- Parameters:
- x: numpy array
optional, default=None. Unprocessed "raw" (usually feature scaling or other processes may be needed) training set feature data. - y: numpy array
optional, default=None. Training set answer data. - random_init: bool
optional, default=True. Turn on random initialization of parameters. - feature_norm: bool
optional, default=True. Turn on feature normalization. This is recommended. - cost_f: callable, string
optional, default=None. The cost function. This can be a external function or the cost functions predefined for the two types of regression algorithms mentioned. In the latter case, a string can be passed in to signify the choice: "linear" for linear regression; "logistic" for logistic regression. - hyp_f: callable, string
optional, default=None. The hypothesis function. This can be a external function or the hypothesis functions predefined for the two types of regression algorithms mentioned. In the latter case, a string can be passed in to signify the choice: "linear" for linear regression; "logistic" for logistic regression. - grad_f: callable
optional, default=None. The gradient function. This can be a external function or the hypothesis functions predefined for the two types of regression algorithms mentioned. In the latter case, nothing need to be specified since these two types of regression have the same symbolic gradient function.
- x: numpy array
- Parameters:
-
AnalysisPackage.LinearRegression.flag_ts_init = False
The flag signifying if the training set is initialized yet. -
AnalysisPackage.LinearRegression.feature_norm = True
The flag signifying if feature normalized was done during initialization. -
AnalysisPackage.LinearRegression.x = None
Processed feature array. -
AnalysisPackage.LinearRegression.y = None
Answer array. -
AnalysisPackage.LinearRegression.theta = None
Parameter array. -
AnalysisPackage.LinearRegression.x_mean = None
Feature mean array for feature scaling. -
AnalysisPackage.LinearRegression.x_std = None
Feature standard deviation for feature scaling. -
AnalysisPackage.LinearRegression.cost_f = None
Cost function. -
AnalysisPackage.LinearRegression.hyp_f = None
Hypothesis function. -
AnalysisPackage.LinearRegression.grad_f = None
Gradient function.
- AnalysisPackage.LinearRegression.init_training_set(x, y, random_init=True)
This method is used to initialize training set. In the case that feature and answer arrays were passed in during the initialization of the whole LinearRegression class, this method is not needed.- Parameters:
- x: numpy array
optional, default=None. Unprocessed "raw" (usually feature scaling or other processes may be needed) training set feature data. If not passed in, whatever stored in AnalysisPackage.LinearRegression.x will be taken. - y: numpy array
optional, default=None. Training set answer data. If not passed in, whatever stored in AnalysisPackage.LinearRegression.y will be taken. - random_init: bool
optional, default=True. Turn on random initialization of parameters.
- x: numpy array
- Parameters:
-
AnalysisPackage.LinearRegression.normlx(x, array_mean=None, xstd=None, gen_new=False)
This method does "feature scaling", which is advised to do in most cases. The feature array x will undergo the following process:x_std = standard_deviation(x) norm = (x - mean(x)) / x_stdThe purpose of this process is to avoid certain features being numerically too large or small than other features, and therefore greatly speed up gradient descent in most cases.
Note: it is assumed that x takes the following dimensionality: x[redundant, examples, features]. For simple problems, x is treated as 2-D matrix of a collection of features for individual examples. There could be additional, i.e. redundant, dimension on top of it, such as different classifiers in multi-class classification problems.
The calculated mean and standard deviation are needed to repeat such process on data sets other than the training set. More specifically, the mean and standard deviation arrays produced when normalizing training set features are needed to do the same exact normalization on feature set without answers during predication using trained parameters.- Parameters:
- x: numpy array
first argument. Feature array that needs to be normalized. - array_mean: numpy array
optional, default=None. Mean of the array. If not passed in, and attribute AnalysisPackage.LinearRegression.x_mean is None, i.e. never generated, the method generates array_mean; otherwise, either array_mean or the attribute AnalysisPackage.LinearRegression.x_mean will be taken (array_mean has higher priority). - xstd: numpy array
optional, default=None. Standard deviation of the array. If not passed in, and attribute AnalysisPackage.LinearRegression.x_std is None, i.e. never generated, the method generates xstd; otherwise, either xstd or the attribute AnalysisPackage.LinearRegression.x_std will be taken (xstd has higher priority). - gen_new: bool
optional, default=False. Setting this to True will force the method to generate new set of array mean and standard deviation.
- x: numpy array
- Returns:
- norm: numpy array
first return. The normalized, i.e. feature scaled, feature array - array_mean: numpy array
second return. Array mean. This will also be stored in attribute AnalysisPackage.LinearRegression.x_mean. - xstd: numpy array
third return. Array standard deviation. This will also be stored in attribute AnalysisPackage.LinearRegression.x_std.
- norm: numpy array
- Parameters:
- AnalysisPackage.LinearRegression.train(max_cycle=int(1e5), alpha=None, convergence=1e-5)
This method trains the parameters, theta, on a training set, using gradient descent algorithm.- Parameters:
- max_cycle: integer, float
optional, default=int(1e5). Maximum number of iterations before the method is aborted. This is simply a safety measure due to the iterative nature of this method. - alpha: float
optional, default=None. Learning rate, which is a factor determining how far each iteration takes in the direction of descending gradients. If not passed in, a learning rate is estimated. Manually setting up this parameter is highly recommended. - convergence: float
optional, default=1e-5. Convergence criteria. If the change of cost function between current and previous iteration is smaller than this value, the parameter set is deemed "converged".
- max_cycle: integer, float
- Returns: numpy array. Parameters (theta) array.
- Parameters:
- AnalysisPackage.LinearRegression.predict(hyp_f=None, x=None, theta=None, feature_norm=None, array_mean=None,
xstd=None)
This method uses optimized theta array to predict y values by applying hypothesis function, h.- Parameters:
- hyp_f: callable
optional, default=None. Hypothesis function. If not passed in, whatever function that was initialized previously will be used. - x: numpy array
optional, default=None. Feature array. Unless this method is used to check if the training set can be reproduced by the optimized parameter array, the feature array for the actual problem should be passed in. - theta: numpy array
optional, default=None. Parameters array. If not passed in, whatever was obtained during the previous run of AnalysisPackage.LinearRegression.train() method will be used. - feature_norm: bool
optional, default=None. Turn on feature normalization. If feature array passed in is already normalized or feature normalization was not done when training the parameters, this needs to be turned off. When this parameter was not specified, the method assumes the same procedure should be done as was done during initialization.
- hyp_f: callable
- Returns: numpy array. Answer array, y. The answers predicted by optimized parameters array based on input feature array.
- Parameters:
Here is a "real life" example where we attempt to decide which one out of a few pre-determined structures a certain molecular structure is the "most similar" to. The model molecule in this case is methane. There are five pre-determined structures with different symmetry: Td, D2d, C2v, C3v, and Cs.
Step 1: recognize the type of machine learning problem. Since we have pre-determined structures, these can serve as training set, i.e. the data set in which we do have correct answers. This is called "supervised machine learning". Since we would like to classify a certain structure into 5 different groups. This is multi-class classification problem. The easiest algorithm is "one-vs-all" algorithm, in which case for each class we use 1, i.e. True, to represent that class, and 0, i.e. False, to represent all other classes, resulting in 5 different classifiers for our 5-class classification problem. The predication would be a value between 0 and 1, i.e. the probability of one class instead of all other classes, for each class. The class that has the highest probability would then be taken as the final classification.
Step 2: decide the features. Intuitively, we noticed that there are patterns unique to each symmetry among the 6 H-C-H bond angles and 4 C-H bond lengths. These are the initial features. We have an additional function, AnalysisPackage.LinearRegression.gen_feature_TdD2dC2vC3vCs() to map these 10 features to 25 features in order to represent those recognized patterns. These are basically standard deviations and/or average values among different subgroups of the 6 H-C-H bond angles and 4 C-H bond lengths.
Step 3: decide the algorithm to optimize the parameters. Here we simply use the easiest to implement: gradient descent.
Step 4: train the parameters, i.e. obtain 5 different sets of parameters corresponding to 5 classifiers.
Step 5: predict the answers, i.e. predict the probabilities of the structure being one class vs. all other classes; then choose the class that has the highest probability.
Following lines are the code to implement these steps:
import trjproc import numpy as np # Training set unprocessed features: geom_Td = [109.47]*6+[1.09188]*4 geom_D2d = [96.4]*4+[141.0]*2+[1.1139]*4 geom_C2v = [54.6]+[114.2]*4+[125.0]+[1.0782]*2+[1.1745]*2 geom_C3v = [95.4]*3+[119.1]*3+[1.0829]*3+[1.3758] geom_Cs = [91.6]*2+[101.3]*2+[141.0]*2+[1.1012, 1.1123, 1.1123, 1.1250] data_raw = np.stack([geom_Td, geom_D2d, geom_C2v, geom_C3v, geom_Cs]) # Stack them to vectorize the run # The following is just a 5 by 5 identity matrix, corresponding to five classifiers and five classes. y = np.eye(5, dtype=float) # Generate the features from bond angles and bond lengths. x_rawfeature = trjproc.AnalysisPackage.LinearRegression.gen_feature_TdD2dC2vC3vCs(data_raw) x = np.array([x_rawfeature, ]*5) # Replicate x_rawfeature five times for the five classifiers # Initialize LinearRegression instance. Feature scaling and random parameter initialization are done by # default and the results are automatically stored in respective attributes. lr = trjproc.AnalysisPackage.LinearRegression(x=x, y=y, cost_f="logistic", hyp_f="logistic") # Now train the parameters. The choice of alpha (learning rate) may need some tweaking before a satisfying # convergence is achieved. theta = lr.train(alpha=0.01) # Load some test data (this example is the bond angles and bond lengths along a BOMD simulation trajectory). test_data = np.loadtxt("example_angle_bonds.txt", skiprows=1) # Generate the features from bond angles and bond lengths similarly. test_rawfeature = lr.gen_feature_TdD2dC2vC3vCs(test_data) # Predict the answers prediction_test = lr.predict(x=np.array([test_rawfeature, ] * 5), theta=theta) # Take the class with the highest probability among the five classifiers as the correct answer: 1 for Td, # 2 for D2d, 3 for C2v, 4 for C3v, 5 for Cs, in the same order as was set up initially. answers = np.argmax(prediction_test, axis=0)+1 - Initialization: AnalysisPackage.LinearRegression(x=None, y=None, random_init=True, feature_norm=True, cost_f=None,
hyp_f=None, grad_f=None)