| Signature | Description | Parameters |
|---|---|---|
template<typename T, typename ... Ts> std::vector<DataFrame> get_data_by_dbscan(const char *col_name, long min_members, double max_distance) const; |
This uses DBSCAN algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column. The last DataFrame in the array contains noisy data. It contains datapoints that could not be placed into any cluster. Ideally, you want the last DataFrame to be empty. Unlike K-Means clustering, you do not have to specify the number of clusters. This works for both scalar and multidimensional (i.e. vectors/arrays) data types. Self is unchanged. NOTE: Currently this only uses the default distance functions in the DBSCAN visitor. NOTE: If this returns zero centroids (zero DataFrames) it is probably because number of iterations is too small to converge. |
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column min_members: Minimum number of datapoints to constitute a cluster max_distance: Maximum distance between two data points in the same cluster |
template<typename T, typename ... Ts> std::vector<PtrView> get_view_by_dbscan(const char *col_name, long min_members, double max_distance); |
This is identical to above get_data_by_dbscan(), but:
|
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column min_members: Minimum number of datapoints to constitute a cluster max_distance: Maximum distance between two data points in the same cluster |
template<typename T, typename ... Ts> std::vector<ConstPtrView> get_view_by_dbscan(const char *col_name, long min_members, double max_distance) const; |
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. |
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column min_members: Minimum number of datapoints to constitute a cluster max_distance: Maximum distance between two data points in the same cluster |
void test_get_data_by_dbscan() { std::cout << "\nTesting get_data_by_dbscan( ) ..." << std::endl; typedef StdDataFrame64<std::string> StrDataFrame; StrDataFrame df; try { df.read("SHORT_IBM.dat", io_format::binary); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; ::exit(-1); } StrDataFrame df2 = df; auto lbd = [](const std::string &, const double &) -> bool { return (true); }; auto view = df2.get_view_by_sel<double, decltype(lbd), double, long>("IBM_Open", lbd); // I am using both views and dataframes to make sure both work // auto views = view.get_view_by_dbscan<double, double, long>("IBM_Close", 10, 4); auto dfs = df.get_data_by_dbscan<double, double, long>("IBM_Close", 10, 4); assert(views.size() == 36); assert(dfs.size() == 36); assert(views[0].get_index().size() == 5); assert(std::fabs(views[0].get_column<double>("IBM_Close")[4] - 185.69) < 0.001); assert(dfs[5].get_index().size() == 30); assert(std::fabs(dfs[5].get_column<double>("IBM_Open")[15] - 180.87) < 0.001); assert(views[16].get_index().size() == 39); assert(std::fabs(views[16].get_column<double>("IBM_High")[3] - 170.85) < 0.001); // This is the last DataFrame which contains the data corresponding to // noisy close prices // assert(views[35].get_index().size() == 16); assert(views[35].get_column<long>("IBM_Volume")[0] == 3821400); assert(views[35].get_index()[1] == "2020-03-12"); }