ulamdyn.unsup_models package
Submodules
ulamdyn.unsup_models.dist_metrics module
This module provides a set of methods to calculate distances between pairs of molecular geometries.
- ulamdyn.unsup_models.dist_metrics.calc_rmsd(x1: ndarray, x2: ndarray) float
Calculate the root-mean square deviation between two aligned geometries.
- Parameters:
x1 (numpy.ndarray) – Matrix of Cartesian coordinates for geometry 1.
x2 (numpy.ndarray) – Matrix of Cartesian coordinates for geometry 2.
- Returns:
RMSD value computed for the two input geometries.
- Return type:
ulamdyn.unsup_models.geom_space module
Module to perform dimension reduction and clustering on geometry space.
- class ulamdyn.unsup_models.geom_space.ClusterGeoms(data, dt=None, indices=None, n_samples=None, scaler=None, random_state=51, n_cpus=-1, verbosity=0)
Bases:
Utils
Class to find groups of similar geometries in MD trajectories data
- gaussian_mixture(n_clusters=5, covariance='full', tol=0.0001, n_init=10, max_iter=500, init='k-means++', save_model=True)
Perform probabilist clustering in geometry space with GMM
- Parameters:
n_clusters (int, optional) – The number of clusters to find, which corresponds to the number of mixed gaussians. Defaults to 5.
covariance (str, optional) – String describing the type of covariance parameters to use, default is “full”. Acceptable values are “full”, “tied”, “diag”, “spherical” (equivalent to K-Means).
tol (float, optional) – Convergence criteria of the lower bound average gain, below which the EM iterations stop, defaults to 1e-4.
n_init (int, optional) – The number of initializations to perform, where best results are kept. The default is 10.
max_iter (int, optional) – The number of EM iterations to perform. Defaults to 500.
init – The method used to initialize the weights, the means and the precisions, default is “k-means++”. Acceptable strings are “k-means”, “k-means++”, “random”, or “‘random_from_data”.
- hierarchical(n_clusters=5, affinity='euclidean', connectivity=None, linkage='complete', distance_threshold=None, save_model=True)
Perform a hierarchical cluster analysis based on the agglomerative
- Parameters:
n_clusters (int, optional) – The number of clusters to find. It must be None if distance_threshold is not None, defaults to 5.
affinity (str, optional) – Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method, defaults to “euclidean”.
connectivity (array-like or callable, optional) – Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, meaning that hierarchical clustering algorithm is unstructured.
linkage (str, optional) –
Define the linkage criterion to build the tree. It determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. The options are : + ‘ward’ -> minimizes the variance of the clusters
being merged.
- ’average’ -> uses the average of the distances of
each observation of the two sets.
- ’complete’ or ‘maximum’ -> uses the maximum distances
between all observations
of the two sets.
- ’single’ -> uses the minimum of the distances between
all observations of the two sets.
The default is “complete”.
distance_threshold (float, optional) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True. Defaults to None.
save_model (bool, optional) – Store the trained parameters of the model in a binary file. Defaults to True.
- Returns:
Dataframe of shape (n_samples,) with cluster labels for each data point.
- Return type:
- kmeans(n_clusters=5, init='k-means++', n_init=100, max_iter=1000, convergence=1e-06, save_model=True)
Perform K-Means clustering in geometry space.
- Parameters:
n_clusters (int, list or str optional) – Specifies the number of clusters (and centroids) to form. Defaults to 5. If a list is provided, K-Means will run for each value in the list. If set to ‘best’, the algorithm will perform multiple runs with n_clusters ranging from 2 to 15. The best result is selected based on silhouette and Calinski-Harabasz scores.
init (str or array, optional) – Method for initialization : + ‘k-means++’ -> selects initial cluster centers for K-means clustering in a smart way to speed up convergence. + ‘random’ -> choose n_clusters observations (rows) at random from data for the initial centroids. + If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. Defaults to “k-means++”.
n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of loss function, defaults to 500.
max_iter (int, optional) – Maximum number of iterations of the k-means algorithm for a single run, defaults to 1000.
convergence (float, optional) – Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. Defaults to 1e-06.
save_model (bool, optional) – Store the trained parameters of the model in a binary file, defaults to True.
- Returns:
Dataframe of shape (n_samples,) with cluster labels for each data point.
- Return type:
- spectral(n_clusters=5, n_components=10, n_init=100, affinity='rbf', gamma=0.01, n_neighbors=10, degree=3, coef0=1, kernel_params=None, save_model=True)
Apply clustering to a projection of the normalized Laplacian.
Note that Spectral Clustering is a highly expensive method due to the computation of the affinity matrix. Hence, this method is recommended only for small to medium size datasets (n_samples < 10000).
Note
This method is equivalent to kernel K-means (DOI: 10.1145/1014052.1014118). Spectral clustering is recommended for non-linearly separable dataset, where the individual clusters have a highly non-convex shape.
- Parameters:
n_clusters (int, optional) – The number of clusters to form which in this case corresponds to the dimension of the projection subspace. Defaults to 5. If a list is passed, spectral clustering will be run for all n_clusters in the list, whereas if the argument is equal to ‘best’, consecutive runs will be performed with n_clusters varying in the range of [2, 15]. In both cases, the final results will be the best output labels with respect to the clustering performance on the silhouette and Calinski-Harabasz scores.
n_components (int, optional) – Number of eigenvectors to use for the spectral embedding. Defaults to 10
n_init (int, optional) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Only used if assign_labels=’kmeans’. Defaults to 100.
affinity (str or callable, optional) –
Method used to construct the affinity matrix. The available options are : + ‘nearest_neighbors’: construct the affinity matrix
by computing a graph of nearest neighbors.
’rbf’: construct the affinity matrix using a radial basis function (RBF) kernel.
’precomputed_nearest_neighbors’: interpret X as a sparse graph of precomputed distances, and construct a binary affinity matrix from the n_neighbors nearest neighbors of each instance.
one of the kernels supported by pairwise_kernels.
The default method is “rbf”.
gamma (float, optional) – Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity=’nearest_neighbors’. Defaults to 0.01.
n_neighbors (int, optional) – Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity=’rbf’. Defaults to 20.
degree (int, optional) – Degree of the polynomial kernel. Ignored by other kernels. Defaults to 3.
coef0 (int, optional) – Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels. Defaults to 1.
kernel_params (dict or str, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.
save_model (bool, optional) – Store the trained parameters of the model in a binary file. Defaults to True.
- Returns:
Dataframe of shape (n_samples,) with cluster labels for each data point.
- Return type:
- class ulamdyn.unsup_models.geom_space.DimensionReduction(data, dt=None, n_samples=None, scaler=None, random_state=42, n_cpus=-1)
Bases:
Utils
Find low dimensional representation of MD trajectories data.
- isomap(n_components=2, n_neighbors=30, neighbors_algorithm='auto', metric=None, p=2, metric_params=None, calc_error=False)
Perform a nonlinear dimensionality reduction with Isometric Mapping
- Parameters:
n_components (int, optional) – Number of coordinates (features) for the low-dimensional manifold, defaults to 2.
n_neighbors (int, optional) – Number of neighbors to consider around each point, defaults to 12.
neighbors_algorithm (str, optional) – Method used for nearest neighbors search, defaults to “auto”
metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. Defaults to “cosine”.
p (int, optional) – Parameter for the Minkowski metric from sklearn.metrics.pairwise pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p minkowski_distance (l_p) is used. Defaults to 2.
metric_params (dict, optional) – Additional keyword arguments for the metric function. Defaults to None.
calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated defaults to False.
- Returns:
a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.
- Return type:
- kpca(n_components=2, kernel='rbf', gamma=None, degree=4, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False)
Perform a nonlinear dimensionality reduction using kernel PCA.
- Parameters:
n_components (int, optional) – Number of components (features) to keep after KPCA transformation, defaults to 2.
kernel (str, optional) – Kernel function used in the transformation. The possible values are ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’ or precomputed’, defaults to “rbf”.
gamma (float, optional) – Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If gamma is None, then it is set to 1/n_features. Defaults to None.
degree (int, optional) – Degree of polynomial kernel. Ignored by other kernels. Defaults to 4.
coef0 (int, optional) – Independent term in poly and sigmoid kernels. Ignored by other kernels. Defaults to 1.
kernel_params (dict, optional) – Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. Defaults to None.
alpha (float, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to 1.0.
fit_inverse_transform (bool, optional) – Hyperparameter of the ridge regression that learns the inverse transform (when fit_inverse_transform=True), defaults to False.
- Returns:
a new dataset with the transformed values where the selected components are stored in columns.
- Return type:
- pca(n_components=2, calc_error=False, save_errors=False)
Perform linear dimension reduction with principal component analysis
Note
By default the percentage of variance explained by each of the selected components will be printed after the PCA analysis.
- Parameters:
n_components (int, optional.) – Number of principal components to keep defaults to 2.
calc_error (bool, optional) – If True, the reconstruction error between the original and the projected data will be calculated defaults to False.
save_errors (bool, optional) – If True, save the reconstruction error calculated for each sample to a csv file, defaults to False.
- Returns:
a new dataset with the transformed values where the selected components are stored in columns.
- Return type:
- tsne(n_components=2, perplexity=40.0, learning_rate=180.0, n_iter=2000, n_iter_without_progress=400, metric='euclidean', init='pca', verbose=1, method='barnes_hut')
Perform the t-distributed Stochastic Neighbor Embedding analysis.
- Parameters:
n_components (int, optional) – Number of coordinates (features) for the low-dimensional embedding, defaults to 2.
perplexity (float, optional) – This hyperparameter is used to control the attention between local and global aspects of the data, in a certain sense, by guessing the number of close neighbors each point has. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significantly different results. Defaults to 40.
learning_rate (float, optional) – The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help. Defaults to 200.0
n_iter (int, optional) – Maximum number of iterations for the optimization. Should be at least 250. Defaults to 2000.
n_iter_without_progress (int, optional) – Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration. Note that progress is only checked every 50 iterations so this value is rounded to the next multiple of 50. Defaults to 400.
metric (str or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy. spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them. The default is “euclidean” which is interpreted as squared euclidean distance. Defaults to “euclidean”.
init (str, optional) – Initialization of embedding. Possible options are ‘random’, ‘pca’, and a numpy array of shape (n_samples, n_components). PCA initialization cannot be used with precomputed distances and is usually more globally stable than random initialization. Defaults to “pca”.
verbose (int, optional) – Verbosity level. Defaults to 1
method (str, optional) – By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples. Defaults to “barnes_hut”.
- Returns:
a new dataset with the transformed values where the coordinates of the low-dimensional manifold are stored in columns.
- Return type:
ulamdyn.unsup_models.traj_space module
Module used to perform clustering analysis on trajectory space.
- class ulamdyn.unsup_models.traj_space.ClusterTrajs(data, dt=None, scaler=None, random_state=42, n_cpus=-1, verbosity=0)
Bases:
Utils
Class used to find groups of similar trajectories in NAMD data.
- kmeans(n_clusters=3, metric='dtw', metric_params=None, n_init=5, max_iter=100, convergence=1e-06, save_model=True)
K-means clustering to group similar MD trajectories.
- Parameters:
n_clusters (int, optional) – Number of clusters to form, defaults to 3.
metric (str, optional) – Metric to be used for both cluster assignment and barycenter computation. Options: “euclidean”, “dtw”, “softdtw”. Defaults to “dtw”.
metric_params (dict or None, optional) – Parameter values for the chosen metric. Defaults to None.
n_init (int, optional) – Number of time the K-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Defaults to 5.
max_iter (int, optional) – Maximum number of iterations of the k-means algorithm for a single run. Defaults to 100.
convergence (float, optional) – Inertia variation threshold. If at some point, inertia varies less than this threshold between two consecutive iterations, the model is considered to have converged and the algorithm stops. Defaults to 1e-6.
save_model (bool, optional) – Store the trained parameters of the model in a binary file. Defaults to True.
- Returns:
Dataframe of shape (n_trajs,) with cluster labels assigned to each trajectory.
- Return type:
- transform()
Convert input data to time series format (tslearn) and apply scaler methods.
- Returns:
Three dimensional numpy array with shape (n_ts, n_steps, n_features)
- Return type:
ulamdyn.unsup_models.utilities module
Auxiliary module to handle data preprocessing and printing.