API

The main BUML (Bottom Up Manifold Learning) class is as follows.

class pyLDLE2.buml_.BUML(d=2, local_opts={}, intermed_opts={}, global_opts={}, vis_opts={}, n_proc=1, exit_at=None, verbose=False, debug=False)[source]

Bottom-up manifold learning.

Parameters:
  • d (int) – Intrinsic dimension of the manifold.

  • local_opts (dict) – Options for local views construction. The key-value pairs provided override the ones in default_local_opts.

  • intermed_opts (dict) – Options for intermediate views construction. The key-value pairs provided override the ones in default_intermed_opts.

  • global_opts (dict) – Options for global views construction. The key-value pairs provided override the ones in default_global_opts.

  • vis_opts (dict) – Options for visualization. The key-value pairs provided override the ones in default_vis_opts.

  • n_proc (int) – The number of processors to use. Defaults to approximately 3/4th of the available processors.

  • verbose (bool) – print logs if True.

  • debug (bool) – saves intermediary objects/data for debugging.

fit(X=None, d_e=None, ddX=None)[source]

Run the algorithm. Either X or d_e must be supplied.

Parameters:
  • X (array shape (n_samples, n_features)) – A 2d array containing data representing a manifold.

  • d_e (array shape (n_samples, n_samples)) – A square numpy matrix representing the geodesic distance between each pair of points on the manifold.

  • ddX (array shape (n_samples)) – Optional. A 1d array representing the distance of the points from the boundary OR a 1d array such that ddX[i] = 0 if the point i i.e. X[i,:] is on the boundary.

Returns:

y – The embedding of the data in lower dimension.

Return type:

array shape (n_samples, d)

Hyperparameters: A description of all the available hyperparameters and their default values are provided below.

class pyLDLE2.buml_.get_default_local_opts(algo='LPCA', metric0='euclidean', k=28, k_nn=49, k_tune=7, metric='euclidean', update_metric=True, radius=0.5, U_method='k_nn', gl_type='unnorm', tuning='self', N=100, scale_by='gamma', Atilde_method='LDLE_1', alpha=1, max_iter=300, reg=0.0, p=0.99, tau=50, delta=0.9, lkpca_kernel='linear', to_postprocess=True, pp_n_thresh=32, lambda1_init=8, lambda1_decay=0.75, lambda1_min=0.001, power=5, max_sparsity=0.9, doubly_stochastic_max_iter=0)[source]

Sets and returns a dictionary of default_local_opts, (experimental) options are work in progress.

Parameters:
  • algo (str) – The algorithm to use for the construction of local views. Options are ‘LDLE’, ‘LPCA’ and ‘LKPCA’. Experimental options are ‘L1PCA’, ‘SparsePCA’, ‘LISOMAP’ and ‘RPCA-GODEC’.

  • metric0 (str) – The metric to be used for finding nearest neighbors. (experimental) If the data is a distance matrix then use ‘precomputed’.

  • k (int) – The size of the local view per point.

  • k_nn (int) – This is used for nearest neighbor graph construction. The neighborhood size is kept bigger than the size of the local views. This is because, during clustering where we construct intermediate views, we need access to distances between points in a bigger neighborhood around each point. The value of k_nn must be >= k. Then the size of neighborhood is set to max(k_nn, eta_max * k) where eta_max is in intermed_opts. This value is used to construct the nearest neighbor graph using metric0.

  • k_tune (int) – Distance to k_tune-th nearest neighbor is used as the local scaling factor. This must be less than k_nn. This is used to construct self-tuning kernel in LDLE.

  • metric (str) – The local metric on the data. (experimental) If the data is a distance matrix then use ‘precomputed’.

  • update_metric (bool) – If True, recomputes neighbor distances using metric. If False, the neighbor distances computed using metric0 are retained. Note that neighbor indices are not changed.

  • radius (float) – Radius of the balls to be used to compute nearest neighbors. (experimental)

  • U_method (str) – Method to use to construct local views. Options are ‘k_nn’ and ‘radius’.

  • gl_type (str) – The type of graph Laplacian to construct in LDLE. Options are ‘unnorm’ for unnormalized, ‘symnorm’ for symmetric normalized, ‘diffusion’ for density-corrected normalized.

  • tuning (str) – The tuning method for the construction of kernel in LDLE. Options are: ‘self’, ‘solo’, ‘median’ or None.

  • N (int) – Number of smallest non-trivial eigenvectors of the graph Laplacian to be used for the construction of local views in LDLE.

  • scale_by (str) – To scale the eigenvectors in LDLE. Options are ‘none’, ‘gamma’ or a positive real value. Default is ‘gamma’. Using numerical value with gl_type as ‘diffusion’ uses diffusion maps to construct local views where the numeric value is used as the power of the eigenvalues.

  • Atilde_method (str) – Method to use for conmputing Atilde in LDLE. Options are ‘FEM’ for finite element method, and ‘FeymanKac’ for Feyman-Kac formula based.

  • p (float) – A hyperparameter used in computing Atilde. The value must be in (0,1).

  • tau (int) – A hyperparameter used in computing parameterizations of local views in LDLE. The value must be in (0,100).

  • delta (float) – A hyperparameter used in computing parameterizations of local views in LDLE. The value must be in (0,1).

  • alpha (float) – Step size in Riemannian gradient descent when using Smooth-LPCA (experimental).

  • max_iter (int) – Max number of Riemannian gradient descent steps in Smooth-LPCA (experimental).

  • reg (float) – Desired regularization (Smoothness) in Smooth-LPCA (experimental).

  • lkpca_kernel (str) – The kernel for local KPCA.

  • to_postprocess (bool) – If True the local parameterizations are postprocessed to fix anamolous parameterizations leading to high distortion of the local views.

  • pp_n_thresh (int) – Threshold to use multiple processors or a single processor while postprocessing the local parameterizations. A small value such as 32 leads to faster postprocessing.

  • lambda1_init (float) – Initialization of lambda1 for RPCA-GODEC (experimental).

  • lambda1_decay (float) – Decay of lambda1 for RPCA-GODEC (experimental).

  • lambda1_min (float) – Minimum allowed valued of lambda1 for RPCA-GODEC (experimental).

  • power (int) – Number of power iterations for initialization in RPCA-GODEC (experimental).

  • max_sparsity (float) – maximum fraction of the desired sparsity (experimental).

  • doubly_stochastic_max_iter (int) – max number of sinkhorn iterations for converting self tuned kernel in LDLE to doubly stochastic.

class pyLDLE2.buml_.get_default_intermed_opts(algo='best', n_times=4, eta_min=5, eta_max=25, len_S_thresh=256)[source]

Sets and returns a dictionary of default_intermed_opts, (experimental) options are work in progress.

Parameters:
  • algo (str) – Algo to use. Options are ‘mnm’ for match and merge, ‘best’ for the optimal algorithm (much slower but creates intermediate views with lower distortion). ‘mnm’ is (experimental).

  • n_times (int) – (experimental) Hypereparameter for ‘mnm’ algo. Number of times to match and merge. If n is the #local views then the #intermediate views will be approximately n/2^{n_times}.

  • eta_min (int) – Hyperparameter for ‘best’ algo. Minimum allowed size of the clusters underlying the intermediate views. The values must be >= 1.

  • eta_max (int) – Hyperparameter for ‘best’ algo. Maximum allowed size of the clusters underlying the intermediate views. The value must be > eta_min.

  • len_S_thresh (int) – Threshold on the number of points for which the costs are to be updated, to invoke multiple processors. Used with ‘best’ algo only.

class pyLDLE2.buml_.get_default_global_opts(align_transform='rigid', to_tear=True, nu=3, max_iter=20, color_tear=True, vis_before_init=False, compute_error=False, init_algo_name='procrustes', align_w_parent_only=True, refine_algo_name='rgd', max_internal_iter=100, alpha=0.3, eps=1e-08, add_dim=False, beta={'align': None, 'repel': 1}, repel_by=0.0, repel_decay=1.0, n_repel=0, far_off_points_type='reuse_fixed', patience=5, err_tol=1e-06, tear_color_method='spectral', tear_color_eig_inds=[1, 0, 2], metric='euclidean', color_cutoff_frac=0.001, color_largest_tear_comp_only=False, n_forced_clusters=1)[source]

Sets and returns a dictionary of default_global_opts, (experimental) options are work in progress.

Parameters:
  • align_transform (str) – The algorithm to use for the alignment of intermediate views. Options are ‘rigid’ and ‘affine’. (experimental) If ‘affine’ is chosen then none of the following hypermateters are used.

  • to_tear (bool) – If True then tear-enabled alignment of views is performed.

  • nu (int) – The ratio of the size of local views in the embedding against those in the data.

  • max_iter (int) – Number of iterations to refine the global embedding for.

  • color_tear (bool) – If True, colors the points across the tear with a spectral coloring scheme.

  • vis_before_init (bool) – If True, plots the global embedding before alignment begins. This is same as just plotting all the intermediate views without alignment.

  • compute_error (bool) – If True the alignment error is computed at each iteration of the refinement, otherwise only at the last iteration.

  • init_algo_name (str) – The algorithm used to compute initial global embedding by aligning the intermediate views. Options are ‘procrustes’ for spanning-tree-based-procrustes alignment, ‘spectral’ for spectral alignment (ignores to_tear), ‘sdp’ for semi-definite programming based alignment (ignores to_tear).

  • align_w_parent_only (bool) – If True, then aligns child views the parent views only in the spanning-tree-based-procrustes alignment.

  • refine_algo_name (str) – The algorithm used to refine the initial global embedding by refining the alignment between intermediate views. Options are ‘gpa’ for Generalized Procustes Analysis (GPA) based alignment, ‘rgd’ for Riemannian gradient descent based alignment, ‘spectral’ for spectral alignment, ‘gpm’ for generalized power method based alignment, ‘sdp’ for semi-definite programming based alignment. Note that sdp based alignment is very slow. Recommended options are ‘rgd’ with an appropriate step size (alpha) and ‘gpm’.

  • max_internal_iter (int) – The number of internal iterations used by the refinement algorithm, for example, RGD updates. This is ignored by ‘spectral’ refinement.

  • alpha (float) – The step size used in the Riemannian gradient descent when the refinement algorithm is ‘rgd’.

  • eps (float) – The tolerance used by sdp solver when the init or refinement algorithm is ‘sdp’.

  • add_dim (bool) – (experimental) add an extra dimension to intermediate views.

  • beta (dict) – (experimental) Hyperparameters used for computing the alignment weights and the repulsion weights. Form is {‘align’: float, ‘repel’: float}. Default is {‘align’: None, ‘repel’: None} i.e. unweighted.

  • repel_by (float) – If positive, the points which are far off are repelled away from each other by a force proportional to this parameter. Ignored when refinement algorithm is ‘gpa’.

  • repel_decay (float) – Multiply repel_decay with current value of repel_by after every iteration.

  • n_repel (int) – The number of far off points repelled from each other.

  • far_off_points_type ('fixed' or 'random') – Whether to use the same points for repulsion or randomize over refinement iterations. If ‘reuse’ is in the string, for example ‘fixed_reuse’, then the points to be repelled are the same across iterations.

  • patience (int) – The number of iteration to wait for error below tolerance to persist before stopping the refinement.

  • err_tol (float) – The tolerance level for the alignment error.

  • tear_color_method (str) – Method to color the tear. Options are ‘spectral’ or ‘heuristic’. The latter keeps the coloring of the tear same accross the iterations. Recommended option is ‘spectral’.

  • tear_color_eig_inds (int) – Eigenvectors to be used to color the tear. The value must either be a non-negative integer or it must be a list of three non-negative integers [R,G,B] representing the indices of eigenvectors to be used as RGB channels for coloring the tear. Higher values result in more diversity of colors. The diversity saturates after a certain value.

  • color_cutoff_frac (float) – If the number of points in a tear component is less than (color_cutoff_frac * number of data points), then all the points in the component will be colored with the same color.

  • color_largest_tear_comp_only (bool) – If True then the largest tear components is colored only.

  • metric (str) – metric assumed on the global embedding. Currently only euclidean is supported.

  • n_forced_clusters (str) – (experimental) Minimum no. of clusters to force in the embeddings.

class pyLDLE2.buml_.get_default_vis_opts(save_dir='', cmap_interior='summer', cmap_boundary='jet', c=None)[source]

Sets and returns a dictionary of default_vis_opts.

Parameters:
  • save_dir (str) – The directory to save the plots in.

  • cmap_interior (str) – The colormap to use for the interior of the manifold.

  • cmap_boundary (str) – The colormap to use for the boundary of the manifold.

  • c (array shape (n_samples)) – The labels for each point to be used to color them.