Back to Documentations

Signature Description Parameters
template<arithmetic T, typename ... Ts>
std::vector<DataFrame>
get_data_by_dbscan(const char *col_name,
                   long min_members,
                   double max_distance,
                   std::function<double(const T &x, const T &y)> &&dfunc =
                       [](const T &x, const T &y) -> double  {
                           return ((x - y) * (x - y));
                       }) const;
This uses DBSCAN algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column. The last DataFrame in the array contains noisy data. It contains datapoints that could not be placed into any cluster. Ideally, you want the last DataFrame to be empty. Unlike K-Means clustering, you do not have to specify the number of clusters.
Self is unchanged.

NOTE: Type T must support arithmetic operations
NOTE: If this returns zero centroids (zero DataFrames) it is probably because number of iterations is too small to converge.
T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
min_members: Minimum number of datapoints to constitute a cluster
max_distance: Maximum distance between two data points in the same cluster
dfunc: A function to calculate the distance between two data points in the named column
template<arithmetic T, typename ... Ts>
std::vector<PtrView>
get_view_by_dbscan(const char *col_name,
                   long min_members,
                   double max_distance,
                   std::function<double(const T &x, const T &y)> &&dfunc =
                       [](const T &x, const T &y) -> double  {
                           return ((x - y) * (x - y));
                       });

This is identical to above get_data_by_dbscan(), but:
  1. The result is a std::vector of views
  2. Since the result is a view, you cannot call make_consistent() on the result.
NOTE: There are certain operations that you cannot do with a view. For example, you cannot add/delete columns, etc.
T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
min_members: Minimum number of datapoints to constitute a cluster
max_distance: Maximum distance between two data points in the same cluster
dfunc: A function to calculate the distance between two data points in the named column
template<arithmetic T, typename ... Ts>
std::vector<ConstPtrView>
get_view_by_dbscan(const char *col_name,
                   long min_members,
                   double max_distance,
                   std::function<double(const T &x, const T &y)> &&dfunc =
                       [](const T &x, const T &y) -> double  {
                           return ((x - y) * (x - y));
                       }) const;
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
min_members: Minimum number of datapoints to constitute a cluster
max_distance: Maximum distance between two data points in the same cluster
dfunc: A function to calculate the distance between two data points in the named column
void test_get_data_by_dbscan()  {

    std::cout << "\nTesting get_data_by_dbscan( ) ..." << std::endl;

    typedef StdDataFrame64<std::string> StrDataFrame;

    StrDataFrame    df;

    try  {
        df.read("SHORT_IBM.dat", io_format::binary);
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }

    StrDataFrame    df2 = df;

    auto    lbd = [](const std::string &, const double &) -> bool { return (true); };
    auto    view = df2.get_view_by_sel<double, decltype(lbd), double, long>("IBM_Open", lbd);

    // I am using both views and dataframes to make sure both work
    //
    auto    views = view.get_view_by_dbscan<double, double, long>("IBM_Close", 10, 4,
                                                                  [](const double &x, const double &y) -> double  {
                                                                      return (std::fabs(x - y));
                                                                  });
    auto    dfs = df.get_data_by_dbscan<double, double, long>("IBM_Close", 10, 4,
                                                              [](const double &x, const double &y) -> double  {
                                                                  return (std::fabs(x - y));
                                                              });

    assert(views.size() == 20);

    assert(views[0].get_index().size() == 11);
    assert(views[0].get_column<double>("IBM_Close")[7] == 184.779999);

    assert(dfs[5].get_index().size() == 127);
    assert(dfs[5].get_column<double>("IBM_Open")[15] == 162.0);

    assert(views[16].get_index().size() == 29);
    assert(views[16].get_column<double>("IBM_High")[3] == 117.75);

    // This is the last DataFrame which contains the data corresponding to
    // noisy close prices
    //
    assert(views[19].get_index().size() == 2);
    assert(views[19].get_column<long>("IBM_Volume")[0] == 10546500);
    assert(views[19].get_index()[1] == "2020-03-23");
}

C++ DataFrame