| Signature | Description | Parameters |
|---|---|---|
template<typename T, typename ... Ts> std::vector<DataFrame> get_data_by_birch(const char *col_name, long k, double threshold, long max_entries = 1000, long max_iter = 1000) const; |
This uses BIRCH algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column. It ultimately uses the K-Means algorithm to calculate the centroids and clusters hence you must specify K. This works for both scalar and multidimensional (i.e. vectors/arrays) data types. Self is unchanged. NOTE: Currently this only uses the default distance functions in the BIRCH visitor. |
T: Type of the named column Ts: List all the types of all data columns. A type should be specified in the list only once. col_name: Name of the given column k: Number of clusters used by K-Means algorithm threshold: It is the maximum radius (standard deviation) allowed for a CF (Clustering Feature) entry max_entries: It is the maximum number of CF (Clustering Feature) entries that can be stored in the flat CF-Tree before it needs to be rebuilt max_iter: It is the maximum number of iterations for K-Means before it converges |
template<typename T, typename ... Ts> std::vector<PtrView> get_view_by_birch(const char *col_name, long k, double threshold, long max_entries = 1000, long max_iter = 1000); |
This is identical to above get_data_by_birch(), but:
|
T: Type of the named column Ts: List all the types of all data columns. A type should be specified in the list only once. col_name: Name of the given column k: Number of clusters used by K-Means algorithm threshold: It is the maximum radius (standard deviation) allowed for a CF (Clustering Feature) entry max_entries: It is the maximum number of CF (Clustering Feature) entries that can be stored in the flat CF-Tree before it needs to be rebuilt max_iter: It is the maximum number of iterations for K-Means before it converges |
ttemplate<typename T, typename ... Ts> std::vector<ConstPtrView> get_view_by_birch(const char *col_name, long k, double threshold, long max_entries = 1000, long max_iter = 1000) const; |
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. |
T: Type of the named column Ts: List all the types of all data columns. A type should be specified in the list only once. col_name: Name of the given column k: Number of clusters used by K-Means algorithm threshold: It is the maximum radius (standard deviation) allowed for a CF (Clustering Feature) entry max_entries: It is the maximum number of CF (Clustering Feature) entries that can be stored in the flat CF-Tree before it needs to be rebuilt max_iter: It is the maximum number of iterations for K-Means before it converges |
void test_get_data_by_birch() { std::cout << "\nTesting get_data_by_birch( ) ..." << std::endl; ULDataFrame df; try { df.read("FORD.csv", io_format::csv2); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; ::exit(-1); } auto lbd = [](const unsigned long &, const double &) -> bool { return (true); }; auto view = df.get_view_by_sel<double, decltype(lbd), double, long>("FORD_Open", lbd); // I am using both views and dataframes to make sure both work // auto views = view.get_view_by_birch<double, double, long>("FORD_Close", 4, 2.5); auto dfs = df.get_data_by_birch<double, double, long>("FORD_Close", 4, 2.5); assert(views.size() == 4); assert(dfs.size() == 4); assert(views[0].get_index().size() == 4367); assert(dfs[0].get_index().size() == 4367); assert(views[1].get_index().size() == 4450); assert(dfs[1].get_index().size() == 4450); assert(views[2].get_index().size() == 2575); assert(dfs[2].get_index().size() == 2575); assert(views[3].get_index().size() == 873); assert(dfs[3].get_index().size() == 873); assert((std::fabs(views[0].get_column<double>("FORD_Close")[7] - 2.08) < 0.01)); assert((std::fabs(dfs[1].get_column<double>("FORD_Open")[15] - 6.91) < 0.01)); assert((std::fabs(views[2].get_column<double>("FORD_High")[3] - 12.07) < 0.01)); assert(dfs[2].get_column<long>("FORD_Volume")[0] == 7512900); assert(views[3].get_index()[1] == 6507); }