Back to Documentations

Signature Description Parameters
template<typename T, typename ... Ts>
std::vector<DataFrame>
get_data_by_dbscan(const char *col_name,
                   long min_members,
                   double max_distance) const;
This uses DBSCAN algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column. The last DataFrame in the array contains noisy data. It contains datapoints that could not be placed into any cluster. Ideally, you want the last DataFrame to be empty. Unlike K-Means clustering, you do not have to specify the number of clusters.
This works for both scalar and multidimensional (i.e. vectors/arrays) data types.
Self is unchanged.

NOTE: Currently this only uses the default distance functions in the DBSCAN visitor.
NOTE: If this returns zero centroids (zero DataFrames) it is probably because number of iterations is too small to converge.
T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
min_members: Minimum number of datapoints to constitute a cluster
max_distance: Maximum distance between two data points in the same cluster
template<typename T, typename ... Ts>
std::vector<PtrView>
get_view_by_dbscan(const char *col_name,
                   long min_members,
                   double max_distance);
This is identical to above get_data_by_dbscan(), but:
  1. The result is a std::vector of views
  2. Since the result is a view, you cannot call make_consistent() on the result.
NOTE: There are certain operations that you cannot do with a view. For example, you cannot add/delete columns, etc.
T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
min_members: Minimum number of datapoints to constitute a cluster
max_distance: Maximum distance between two data points in the same cluster
template<typename T, typename ... Ts>
std::vector<ConstPtrView>
get_view_by_dbscan(const char *col_name,
                   long min_members,
                   double max_distance) const;
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
min_members: Minimum number of datapoints to constitute a cluster
max_distance: Maximum distance between two data points in the same cluster
void test_get_data_by_dbscan()  {

    std::cout << "\nTesting get_data_by_dbscan( ) ..." << std::endl;

    typedef StdDataFrame64<std::string> StrDataFrame;

    StrDataFrame    df;

    try  {
        df.read("SHORT_IBM.dat", io_format::binary);
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
        ::exit(-1);
    }

    StrDataFrame    df2 = df;

    auto    lbd = [](const std::string &, const double &) -> bool { return (true); };
    auto    view = df2.get_view_by_sel<double, decltype(lbd), double, long>("IBM_Open", lbd);

    // I am using both views and dataframes to make sure both work
    //
    auto    views = view.get_view_by_dbscan<double, double, long>("IBM_Close", 10, 4);
    auto    dfs = df.get_data_by_dbscan<double, double, long>("IBM_Close", 10, 4);

    assert(views.size() == 36);
    assert(dfs.size() == 36);

    assert(views[0].get_index().size() == 5);
    assert(std::fabs(views[0].get_column<double>("IBM_Close")[4] - 185.69) < 0.001);

    assert(dfs[5].get_index().size() == 30);
    assert(std::fabs(dfs[5].get_column<double>("IBM_Open")[15] - 180.87) < 0.001);

    assert(views[16].get_index().size() == 39);
    assert(std::fabs(views[16].get_column<double>("IBM_High")[3] - 170.85) < 0.001);

    // This is the last DataFrame which contains the data corresponding to
    // noisy close prices
    //
    assert(views[35].get_index().size() == 16);
    assert(views[35].get_column<long>("IBM_Volume")[0] == 3821400);
    assert(views[35].get_index()[1] == "2020-03-12");
}

C++ DataFrame