Facility Location

class submodlib.functions.facilityLocation.FacilityLocationFunction(n, mode, separate_rep=None, n_rep=None, sijs=None, data=None, data_rep=None, num_clusters=None, cluster_labels=None, metric='cosine', num_neighbors=None, create_dense_cpp_kernel_in_python=True, pybind_mode='array')[source]

Implementation of the Facility Location submodular function (FL).

Facility-Location function [MF90] attempts to model representation, as in it tries to find a representative subset of items, akin to centroids and medoids in clustering. The Facility-Location function is closely related to k-medoid clustering. While diversity only looks at the elements in the chosen subset, representativeness also worries about their similarity with the remaining elements in the superset. Denote \(s_{ij}\) as the similarity between data points \(i\) and \(j\). We can then define

\[f(X) = \sum_{i \in V} \max_{j \in X} s_{ij}\]

For each data point \(i\) in the ground set \(V\), we compute the representative from subset \(X\) which is closest to \(i\) and add these similarities for all data points.

In a more generic setting, the set whose representation is desired (we call it represented set \(U\)) may be different from the set whose subset is desired (we call it ground set \(V\)). The expression for Facility-Location function then becomes

\[f(X) = \sum_{i \in U} \max_{j \in X} s_{ij}\]

An alternative clustered implementation of Facility Location assumes a clustering of all ground set items and then the function value is computed over the clusters as

\[f(X) = \sum_{l \in {1....k}} \sum_{i \in C_l} \max_{j \in X \cap C_l} s_{ij}\]

Facility-Location is monotone submodular.

Note

This function requires computing a \(\mathcal{O}(n^2)\) similarity function. However, as shown in [WIB14], we can approximate this with a nearest neighbor graph, which will require much less storage, and also can run much faster for large ground set sizes.

Parameters
  • n (int) – Number of elements in the ground set, must be > 0.

  • mode (string) – Can be “dense”, “sparse” or “clustered”. It specifies whether the Facility Location function should operate in dense mode (using a dense similarity kernel) or sparse mode (using a sparse similarity kernel) or clustered mode (evaluating over clusters).

  • separate_rep (bool, optional) – Specifies whether a set different from ground set should be used as represented set (whose representation is desired).

  • n_rep (int, optional) – Number of elements in the represented set if separate_rep=True.

  • sijs (numpy.ndarray or scipy.sparse.csr.csr_matrix, optional) – When separate_rep=False, this is the similarity kernel (dense or sparse) between the elements of the ground set, to be used for getting \(s_{ij}\) entries as defined above. Shape of dense kernel in this case must be n X n. When separate_rep=True, mode must be “dense” and this is the dense similarity kernel between the represented set and the ground set. Shape in this case must be n_rep X n. When sijs is not provided, it is computed internally in C++ based on the following additional parameters.

  • data (numpy.ndarray, optional) – Matrix of shape n X num_features containing the ground set data elements. data[i] should contain the num-features dimensional features of element i. Used to compute the similarity kernel. It is optional (and is ignored if provided) if sijs has been provided.

  • data_rep (numpy.ndarray, optional) – Represented set data matrix (used to compute the dense similarity kernel) if separate_rep=True and when a similarity kernel is not provided.

  • num_clusters (int, optional) – Number of clusters in the ground set. Used only if mode = “clustered”. Must be provided if cluster_labels is provided. If cluster_labels is not provided, clusters will be created using sklearn’s BIRCH method. In this case if num_clusters is not provided, BIRCH will produce an optimum number of clusters.

  • cluster_labels (list, optional) – List of size n that contains cluster label for each item in the groundset. If mode=clustered and cluster_labels is not provided, clustering is done internally using sklearn’s BIRCH.

  • metric (str, optional) – Similarity metric to be used for computing the similarity kernel. Can be “cosine” for cosine similarity or “euclidean” for similarity based on euclidean distance. Default is “cosine”.

  • num_neighbors (int, optional) – Number of neighbors applicable for the sparse similarity kernel. Must not be provided if mode is “dense”. Must be provided if either a sparse kernel is provided or is to be computed.

  • create_dense_cpp_kernel_in_python (bool, optional) – Should be set to False ONLY when a similarity kernel is not provided and a CPP kernel is desired to be created in CPP. Default is True.

  • pybind_mode (string, optional) – Specifies mode of pybind type conversion from Python to CPP. Can be one of list, numpyarray and array. list is the slowest, requiring converting numpy arrays to Python lists. numpyarray” relies on automatic conversion. *array leverages native data types and is the fastest. Default is “array”.

clearMemoization()

Clear the computed memoized statistics, if any.

evaluate(X)

Computes the score of a set as per the above math.

Parameters

X (set) – The set whose score needs to be computed. Must be a subset of effective ground set.

Returns

The evaluation score of the given set.

Return type

float

evaluateWithMemoization(X)

Efficiently compute the function evaluation of a set assuming that memoized statistics for it are already computed.

Parameters

X (set) – The set on which the function needs to be evaluated. It must be a subset of the effective ground set.

Returns

The function evaluation score on the given set.

Return type

float

getEffectiveGroundSet()

Get the effective ground set of this object.

marginalGain(X, element)

Computes the marginal gain in score of this function when a single item (element) is added to a set (X).

Parameters
  • X (set) – Set on which the marginal gain of adding an element has to be calculated. It must be a subset of the effective ground set.

  • element (int) – Element for which the marginal gain is to be calculated. It must be from the effective ground set.

Returns

Marginal gain of adding element to X.

Return type

float

marginalGainWithMemoization(X, element)

Efficiently find the marginal gain in score when a single item (element) is added to a set (X) assuming that memoized statistics for X are already computed.

Parameters
  • X (set) – Set on which the marginal gain of adding an element has to be calculated. It must be a subset of the effective ground set and its memoized statistics should have already been computed.

  • element (int) – Element for which the marginal gain is to be calculated. It must be from the effective ground set.

Returns

Marginal gain of adding element to X.

Return type

float

maximize(budget, optimizer='NaiveGreedy', stopIfZeroGain=False, stopIfNegativeGain=False, epsilon=0.1, verbose=False, show_progress=True, costs=None, costSensitiveGreedy=False)

Compute the optimal subset with maximum score for the given budget.

Parameters
  • budget (int) – Desired size of the optimal set.

  • optimizer (string) – The optimizer that should be used to compute the optimal set. Can be ‘NaiveGreedy’, ‘StochasticGreedy’, LazyGreedy’ and ‘LazierThanLazyGreedy’.

  • stopIfZeroGain (bool) – Set to True if maximization should terminate as soon as gain of adding any other item becomes zero. When True, size of optimal set can thus be potentially less than the budget.

  • stopIfNegativeGain (bool) – Set to True if maximization should terminate as soon as the best gain in an iteration is negative. When True, this can potentially lead to optimal set of size less than the budget.

  • epsilon (float) – Used by Stochastic (Random) Greedy and Lazier Than Lazy Greedy to compute the size of the random set.

  • verbose (bool) – Set to True to trace/debug the execution of the maximization algorithm.

  • show_progress (bool) – Set to True to see progress a progress bar.

  • costs (list, optional) – List containing cost of each element of the ground set. Cost contributes to the budget. When costSensitiveGreedy is set to True, the marginal gain is divided by the cost to identify the next best element to add in every iteration. Default is None which means all ground set elements have cost = 1. It is possible to specify costs and yet have costSensitiveGreedy set to False. This would correspond use regular marginal gains, but the budget gets filled as per the costs of selected items.

  • costSensitiveGreedy (bool, optional) – When set to True, the next best candidate in every iteration is decided based on their marginal gain divided by cost. When True, it is mandatory to provide costs. Defaults to False.

Returns

The optimal set of size budget.

Return type

set

setMemoization(X)

Compute and store the memoized statistics for subset X.

Parameters

X (set) – The set for which memoized statistics need to be computed and set, overwriting any existing memoized statistics.

updateMemoization(X, element)

Update the memoized statistics of a set X due to adding an element to it. Assumes that memoized statistics are already computed for X. Note that the element is not added to the set and only the memoized statistics are updated. The actual insertion of element to X is the responsibility of the caller.

Parameters
  • X (set) – Set whose memoized statistics must already be computed and to which the element needs to be added for the sake of updating the memoized statistics.

  • element (int) – Element that is being added to X leading to update of memoized statistics. It must be from the effective ground set.