| Signature | Description | Parameters |
|---|---|---|
template<arithmetic T, typename ... Ts> std::vector<DataFrame> get_data_by_dbscan(const char *col_name, long min_members, double max_distance, std::function<double(const T &x, const T &y)> &&dfunc = [](const T &x, const T &y) -> double { return ((x - y) * (x - y)); }) const; |
This uses DBSCAN algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column. The last DataFrame in the array contains noisy data. It contains datapoints that could not be placed into any cluster. Ideally, you want the last DataFrame to be empty. Unlike K-Means clustering, you do not have to specify the number of clusters. Self is unchanged. NOTE: Type T must support arithmetic operations NOTE: If this returns zero centroids (zero DataFrames) it is probably because number of iterations is too small to converge. |
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column min_members: Minimum number of datapoints to constitute a cluster max_distance: Maximum distance between two data points in the same cluster dfunc: A function to calculate the distance between two data points in the named column |
template<arithmetic T, typename ... Ts> std::vector<PtrView> get_view_by_dbscan(const char *col_name, long min_members, double max_distance, std::function<double(const T &x, const T &y)> &&dfunc = [](const T &x, const T &y) -> double { return ((x - y) * (x - y)); }); |
This is identical to above get_data_by_dbscan(), but:
|
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column min_members: Minimum number of datapoints to constitute a cluster max_distance: Maximum distance between two data points in the same cluster dfunc: A function to calculate the distance between two data points in the named column |
template<arithmetic T, typename ... Ts> std::vector<ConstPtrView> get_view_by_dbscan(const char *col_name, long min_members, double max_distance, std::function<double(const T &x, const T &y)> &&dfunc = [](const T &x, const T &y) -> double { return ((x - y) * (x - y)); }) const; |
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. |
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column min_members: Minimum number of datapoints to constitute a cluster max_distance: Maximum distance between two data points in the same cluster dfunc: A function to calculate the distance between two data points in the named column |
void test_get_data_by_dbscan() { std::cout << "\nTesting get_data_by_dbscan( ) ..." << std::endl; typedef StdDataFrame64<std::string> StrDataFrame; StrDataFrame df; try { df.read("SHORT_IBM.dat", io_format::binary); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; } StrDataFrame df2 = df; auto lbd = [](const std::string &, const double &) -> bool { return (true); }; auto view = df2.get_view_by_sel<double, decltype(lbd), double, long>("IBM_Open", lbd); // I am using both views and dataframes to make sure both work // auto views = view.get_view_by_dbscan<double, double, long>("IBM_Close", 10, 4, [](const double &x, const double &y) -> double { return (std::fabs(x - y)); }); auto dfs = df.get_data_by_dbscan<double, double, long>("IBM_Close", 10, 4, [](const double &x, const double &y) -> double { return (std::fabs(x - y)); }); assert(views.size() == 20); assert(views[0].get_index().size() == 11); assert(views[0].get_column<double>("IBM_Close")[7] == 184.779999); assert(dfs[5].get_index().size() == 127); assert(dfs[5].get_column<double>("IBM_Open")[15] == 162.0); assert(views[16].get_index().size() == 29); assert(views[16].get_column<double>("IBM_High")[3] == 117.75); // This is the last DataFrame which contains the data corresponding to // noisy close prices // assert(views[19].get_index().size() == 2); assert(views[19].get_column<long>("IBM_Volume")[0] == 10546500); assert(views[19].get_index()[1] == "2020-03-23"); }