Back to Documentations

Signature Description Parameters
#include <DataFrame/DataFrameMLVisitors.h>

template<size_t K, typename T, typename I = unsigned long,
         std::size_t A = 0>
struct SpectralClusteringVisitor;
This is a single action visitor, meaning it is passed the whole data vector in one call and you must use the single_act_visit() interface.

Spectral clustering is a graph-based algorithm for finding k arbitrarily shaped clusters in data. The technique involves representing the data in a low dimension. In the low dimension, clusters in the data are more widely separated, enabling you to use KMeans clustering. This low dimension is based on eigenvectors of a Laplacian matrix. A Laplacian matrix is one way of representing a similarity graph that models the local neighborhood relationships between data points as an undirected graph. You can use spectral clustering when you know the number of clusters, but the algorithm also provides a way to estimate the number of clusters in your data.
Spectral clustering is very interesting and different from other clustering algorithms. Other clustering algorithms cluster the data, one way or another, based on proximity of data. But spectral algorithm clusters based on patterns. It doesn’t cluster the actual data. It clusters the Eigenvectors of the Laplacian matrix. The downside is it requires very intensive calculations and it might be slow for large datasets.

The constructor takes 4 parameters
  1. Number of iterations for KMeans calculations
  2. sigma -- used in a Gaussian kernel function, that controls the "width" of the neighborhood considered when calculating similarities between datapoints; essentially, how far apart two points can be while still being considered "similar" to each other
  3. A function to calculate similarity between two datapoints of type T (with default)
  4. Seed for random number generation. The defualt is to generate different set each time.
SpectralClusteringVisitor(
    size_type num_of_iter,
    double sigma,
    seed_t seed = seed_t(-1),
    similarity_func sf =
        [](const value_type &x, const value_type &y, double sigma) -> double  {
            return (std::exp(-((x - y) * (x - y)) / (2 * sigma * sigma)));
        });
        
get_results() returns an array of K VectorPtrView's which contain the data clustered around the K means of the eigenvectors of the Laplacian matrix. The first element in each VectorPtrView is the mean and the reset are the data points belonging to that cluster.
get_clusters_idxs() returns an array of K std::vector<size_type>'s which contains indices of the data in each cluster.

NOTE: There is no interface to return the K means/centroids. The means are the means of the eigenvectors of the Laplacian matrix which are not particularly useful.
K: Number of centroids to find
T: Column data type
I: Index type
A: Memory alignment boundary for vectors. Default is system default alignment
static void test_SpectralClusteringVisitor()  {

    std::cout << "\nTesting SpectralClusteringVisitor{ } ..." << std::endl;

    typedef StdDataFrame64<std::string> StrDataFrame;

    StrDataFrame    df;

    try  {
        df.read("SHORT_IBM.csv", io_format::csv2, { .starting_row = 0, .num_rows = 1000 });
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
        ::exit(-1);
    }

    spect_v<4, double, std::string, 64> spc(1000, 7, 123);

    df.single_act_visit<double>("IBM_Close", spc);

    const auto  &clusters = spc.get_result();

    assert(clusters.size() == 4);

    assert(clusters[0].size() == 16);
    assert(clusters[1].size() == 89);
    assert(clusters[2].size() == 207);
    assert(clusters[3].size() == 688);

    assert(std::fabs(clusters[0][0] - 121.86) < 0.001);
    assert(std::fabs(clusters[0][8] - 124.83) < 0.001);
    assert(std::fabs(clusters[0][12] - 120.19) < 0.001);
    assert(std::fabs(clusters[0][15] - 122.74) < 0.001);

    assert(std::fabs(clusters[1][0] - 177.9) < 0.001);
    assert(std::fabs(clusters[1][23] - 171.12) < 0.001);
    assert(std::fabs(clusters[1][61] - 177.18) < 0.001);
    assert(std::fabs(clusters[1][88] - 170.05) < 0.001);

    assert(std::fabs(clusters[2][0] - 185.53) < 0.001);
    assert(std::fabs(clusters[2][100] - 181.22) < 0.001);
    assert(std::fabs(clusters[2][200] - 179.82) < 0.001);
    assert(std::fabs(clusters[2][206] - 179.45) < 0.001);

    assert(std::fabs(clusters[3][0] - 169.1) < 0.001);
    assert(std::fabs(clusters[3][300] - 140.19) < 0.001);
    assert(std::fabs(clusters[3][542] - 152.51) < 0.001);
    assert(std::fabs(clusters[3][687] - 153.23) < 0.001);
}

C++ DataFrame