Back to Documentations

Signature Description Parameters
#include <DataFrame/DataFrameMLVisitors.h>

template<typename T, typename I = unsigned long,
         std::size_t A = 0>
struct DBSCANVisitor;
This is a single action visitor, meaning it is passed the whole data vector in one call and you must use the single_act_visit() interface.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away). DBSCAN is one of the most commonly used and cited clustering algorithms.
The constructor takes 3 parameters
  1. Minimum number of datapoints to constitute a cluster
  2. The distance used to determine if a data point is in the same area as other data points
  3. A function to calculate distance between two data points of type T (with default)
  DBSCANVisitor(long min_mems,
                double max_dist,
                distance_func f = [](const T &x, const T &y) -> double {
                                      return ((x - y) * (x - y));
                                  })
        
get_results() Returns a vector of vectors containing datapoint values of each cluster.

get_clusters_idxs() Returns a vector of vectors containing indices to datapoints of each cluster.

get_noisey_idxs() Returns a vector containing indices to datapoints that could not be placed in any cluster. Ideally you want this to be empty.
T: Column data type
I: Index type
A: Memory alignment boundary for vectors. Default is system default alignment
static void test_DBSCANVisitor()  {

    std::cout << "\nTesting DBSCANVisitor{ } ..." << std::endl;

    typedef StdDataFrame64<std::string> StrDataFrame;

    StrDataFrame    df;

    try  {
        df.read("SHORT_IBM.csv", io_format::csv2);
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }

    auto    lbd = [](const std::string &, const double &) -> bool { return (true); };
    auto    view = df.get_view_by_sel<double, decltype(lbd), double, long>("IBM_Open", lbd);

    DBSCANVisitor<double, std::string, 64>  dbscan(10,
                                                   4,
                                                   [](const double &x, const double &y)  {
                                                       return (std::fabs(x - y));
                                                   });

    view.single_act_visit<double>("IBM_Close", dbscan);

    assert(dbscan.get_noisey_idxs().size() == 2);
    assert(dbscan.get_noisey_idxs()[0] == 1564);
    assert(dbscan.get_noisey_idxs()[1] == 1565);

    assert(dbscan.get_result().size() == 19);
    assert(dbscan.get_result()[0].size() == 11);
    assert(dbscan.get_result()[4].size() == 31);
    assert(dbscan.get_result()[10].size() == 294);
    assert(dbscan.get_result()[14].size() == 82);
    assert(dbscan.get_result()[18].size() == 10);
    assert(dbscan.get_result()[0][6] == 185.679993);
    assert(dbscan.get_result()[4][18] == 167.330002);
    assert(dbscan.get_result()[10][135] == 145.160004);
    assert(dbscan.get_result()[18][3] == 103.550003);
}

C++ DataFrame