| Signature | Description | Parameters |
|---|---|---|
template<arithmetic T, typename ... Ts> std::vector<DataFrame> get_data_by_affin(const char *col_name, std::function<double(const T &x, const T &y)> &&dfunc = [](const T &x, const T &y) -> double { return ((x - y) * (x - y)); }, size_type num_of_iter = 20, double damping_factor = 0.9) const; |
This uses Affinity Propagation algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column. Unlike K-Means clustering, you do not have to specify the number of clusters. Self is unchanged. NOTE: This is a resource consuming and relatively slow algorithm. Its time complexity is O(I * n2) where I is number of iterations. Its space complexity is O(2 * n2). NOTE: Type T must support arithmetic operations NOTE: This algorithm might be too slow for large datasets. Also, see get_[data|view]_by_kmeans(). NOTE: If this returns zero centroids (zero DataFrames) it is probably because number of iterations is too small to converge. |
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column dfunc: A function to calculate the distance between two data points in the named column num_of_iter: Number of iterations for AP clustering algorithm to converge damping_factor: It is used in the algorithm. The default is 0.9. (1 – damping factor) prevents numerical oscillations |
template<arithmetic T, typename ... Ts> std::vector<PtrView> get_view_by_affin(const char *col_name, std::function<double(const T &x, const T &y)> &&dfunc = [](const T &x, const T &y) -> double { return ((x - y) * (x - y)); }, size_type num_of_iter = 20, double damping_factor = 0.9); |
This is identical to above get_data_by_affin(), but:
|
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column dfunc: A function to calculate the distance between two data points in the named column num_of_iter: Number of iterations for AP clustering algorithm to converge damping_factor: It is used in the algorithm. The default is 0.9. (1 – damping factor) prevents numerical oscillations |
template<arithmetic T, typename ... Ts> std::vector<ConstPtrView> get_view_by_affin(const char *col_name, std::function<double(const T &x, const T &y)> &&dfunc = [](const T &x, const T &y) -> double { return ((x - y) * (x - y)); }, size_type num_of_iter = 20, double damping_factor = 0.9) const; |
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. |
T: Type of the named column Ts: The list of types for all columns. A type should be specified only once col_name: Name of the data column dfunc: A function to calculate the distance between two data points in the named column num_of_iter: Number of iterations for AP clustering algorithm to converge damping_factor: It is used in the algorithm. The default is 0.9. (1 – damping factor) prevents numerical oscillations |
static void test_get_data_by_affin() { std::cout << "\nTesting get_data_by_affin( ) ..." << std::endl; typedef StdDataFrame64<std::string> StrDataFrame; StrDataFrame df; try { df.read("SHORT_IBM.dat", io_format::binary); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; } StrDataFrame df2 = df; auto lbd = [](const std::string &, const double &) -> bool { return (true); }; auto view = df2.get_view_by_sel<double, decltype(lbd), double, long>("IBM_Open", lbd); auto views = view.get_view_by_affin<double, double, long>("IBM_Close", [](const double &x, const double &y) -> double { return (std::fabs(x - y)); }, 25); // Number of iterations assert(views.size() == 4); assert(views[0].get_index().size() == 157); assert(views[0].get_column<double>("IBM_Open").size() == 157); assert(views[0].get_index()[0] == "2014-10-21"); assert(views[0].get_index()[156] == "2018-02-01"); assert(views[0].get_column<double>("IBM_High")[140] == 162.899994); assert(views[0].get_column<long>("IBM_Volume")[100] == 2543100); assert(views[1].get_index().size() == 309); assert(views[1].get_column<double>("IBM_Open").size() == 309); assert(views[1].get_index()[0] == "2014-01-02"); assert(views[1].get_index()[308] == "2018-01-18"); assert(views[1].get_column<double>("IBM_High")[200] == 182.839996); assert(views[1].get_column<long>("IBM_Volume")[100] == 3721600); assert(views[2].get_index().size() == 256); assert(views[2].get_column<double>("IBM_Open").size() == 256); assert(views[2].get_index()[0] == "2014-11-20"); assert(views[2].get_index()[255] == "2020-02-13"); assert(views[2].get_column<double>("IBM_High")[200] == 156.800003); assert(views[2].get_column<long>("IBM_Volume")[100] == 2838100); assert(views[3].get_index().size() == 999); assert(views[3].get_column<double>("IBM_Open").size() == 999); assert(views[3].get_index()[0] == "2014-12-15"); assert(views[3].get_index()[998] == "2020-10-30"); assert(views[3].get_column<double>("IBM_High")[200] == 152.929993); assert(views[3].get_column<long>("IBM_Volume")[100] == 3924800); }