Back to Documentations

Signature Description Parameters
template<arithmetic T, typename ... Ts>
std::vector<DataFrame>
get_data_by_affin(const char *col_name,
                  std::function<double(const T &x, const T &y)> &&dfunc =
                      [](const T &x, const T &y) -> double  {
                          return ((x - y) * (x - y));
                      },
                  size_type num_of_iter = 20,
                  double damping_factor = 0.9) const;
This uses Affinity Propagation algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column. Unlike K-Means clustering, you do not have to specify the number of clusters.
Self is unchanged.

NOTE: This is a resource consuming and relatively slow algorithm. Its time complexity is O(I * n2) where I is number of iterations. Its space complexity is O(2 * n2).
NOTE: Type T must support arithmetic operations
NOTE: This algorithm might be too slow for large datasets. Also, see get_[data|view]_by_kmeans().
NOTE: If this returns zero centroids (zero DataFrames) it is probably because number of iterations is too small to converge.
T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
dfunc: A function to calculate the distance between two data points in the named column
num_of_iter: Number of iterations for AP clustering algorithm to converge
damping_factor: It is used in the algorithm. The default is 0.9. (1 – damping factor) prevents numerical oscillations
template<arithmetic T, typename ... Ts>
std::vector<PtrView>
get_view_by_affin(const char *col_name,
                  std::function<double(const T &x, const T &y)> &&dfunc =
                      [](const T &x, const T &y) -> double  {
                          return ((x - y) * (x - y));
                      },
                  size_type num_of_iter = 20,
                  double damping_factor = 0.9);
This is identical to above get_data_by_affin(), but:
  1. The result is a std::vector of views
  2. Since the result is a view, you cannot call make_consistent() on the result.
NOTE: There are certain operations that you cannot do with a view. For example, you cannot add/delete columns, etc.
T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
dfunc: A function to calculate the distance between two data points in the named column
num_of_iter: Number of iterations for AP clustering algorithm to converge
damping_factor: It is used in the algorithm. The default is 0.9. (1 – damping factor) prevents numerical oscillations
template<arithmetic T, typename ... Ts>
std::vector<ConstPtrView>
get_view_by_affin(const char *col_name,
                  std::function<double(const T &x, const T &y)> &&dfunc =
                      [](const T &x, const T &y) -> double  {
                          return ((x - y) * (x - y));
                      },
                  size_type num_of_iter = 20,
                  double damping_factor = 0.9) const;
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. T: Type of the named column
Ts: The list of types for all columns. A type should be specified only once
col_name: Name of the data column
dfunc: A function to calculate the distance between two data points in the named column
num_of_iter: Number of iterations for AP clustering algorithm to converge
damping_factor: It is used in the algorithm. The default is 0.9. (1 – damping factor) prevents numerical oscillations
static void test_get_data_by_affin()  {

    std::cout << "\nTesting get_data_by_affin( ) ..." << std::endl;

    typedef StdDataFrame64<std::string> StrDataFrame;

    StrDataFrame    df;

    try  {
        df.read("SHORT_IBM.dat", io_format::binary);
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
    }

    StrDataFrame    df2 = df;

    auto    lbd = [](const std::string &, const double &) -> bool { return (true); };
    auto    view = df2.get_view_by_sel<double, decltype(lbd), double, long>("IBM_Open", lbd);

    auto    views =
        view.get_view_by_affin<double, double, long>("IBM_Close",
                                                     [](const double &x, const double &y) -> double {
                                                         return (std::fabs(x - y));
                                                     },
                                                     25);  //  Number of iterations

    assert(views.size() == 4);

    assert(views[0].get_index().size() == 157);
    assert(views[0].get_column<double>("IBM_Open").size() == 157);
    assert(views[0].get_index()[0] == "2014-10-21");
    assert(views[0].get_index()[156] == "2018-02-01");
    assert(views[0].get_column<double>("IBM_High")[140] == 162.899994);
    assert(views[0].get_column<long>("IBM_Volume")[100] == 2543100);

    assert(views[1].get_index().size() == 309);
    assert(views[1].get_column<double>("IBM_Open").size() == 309);
    assert(views[1].get_index()[0] == "2014-01-02");
    assert(views[1].get_index()[308] == "2018-01-18");
    assert(views[1].get_column<double>("IBM_High")[200] == 182.839996);
    assert(views[1].get_column<long>("IBM_Volume")[100] == 3721600);

    assert(views[2].get_index().size() == 256);
    assert(views[2].get_column<double>("IBM_Open").size() == 256);
    assert(views[2].get_index()[0] == "2014-11-20");
    assert(views[2].get_index()[255] == "2020-02-13");
    assert(views[2].get_column<double>("IBM_High")[200] == 156.800003);
    assert(views[2].get_column<long>("IBM_Volume")[100] == 2838100);

    assert(views[3].get_index().size() == 999);
    assert(views[3].get_column<double>("IBM_Open").size() == 999);
    assert(views[3].get_index()[0] == "2014-12-15");
    assert(views[3].get_index()[998] == "2020-10-30");
    assert(views[3].get_column<double>("IBM_High")[200] == 152.929993);
    assert(views[3].get_column<long>("IBM_Volume")[100] == 3924800);
}

C++ DataFrame