Back to Documentations

Signature Description Parameters
template<typename T, typename ... Ts>
std::vector<DataFrame>
get_data_by_birch(const char *col_name,
                  long k,
                  double threshold,
                  long max_entries = 1000,
                  long max_iter = 1000) const;
This uses BIRCH algorithm to divide the named column into clusters. It returns an array of DataFrame's each containing one of the clusters of data based on the named column.
It ultimately uses the K-Means algorithm to calculate the centroids and clusters hence you must specify K.
This works for both scalar and multidimensional (i.e. vectors/arrays) data types.
Self is unchanged.

NOTE: Currently this only uses the default distance functions in the BIRCH visitor.
T: Type of the named column
Ts: List all the types of all data columns. A type should be specified in the list only once.
col_name: Name of the given column
k: Number of clusters used by K-Means algorithm
threshold: It is the maximum radius (standard deviation) allowed for a CF (Clustering Feature) entry
max_entries: It is the maximum number of CF (Clustering Feature) entries that can be stored in the flat CF-Tree before it needs to be rebuilt
max_iter: It is the maximum number of iterations for K-Means before it converges
template<typename T, typename ... Ts>
std::vector<PtrView>
get_view_by_birch(const char *col_name,
                  long k,
                  double threshold,
                  long max_entries = 1000,
                  long max_iter = 1000);
This is identical to above get_data_by_birch(), but:
  1. The result is a std::vector of views
  2. Since the result is a view, you cannot call make_consistent() on the result.
NOTE: There are certain operations that you cannot do with a view. For example, you cannot add/delete columns, etc.
T: Type of the named column
Ts: List all the types of all data columns. A type should be specified in the list only once.
col_name: Name of the given column
k: Number of clusters used by K-Means algorithm
threshold: It is the maximum radius (standard deviation) allowed for a CF (Clustering Feature) entry
max_entries: It is the maximum number of CF (Clustering Feature) entries that can be stored in the flat CF-Tree before it needs to be rebuilt
max_iter: It is the maximum number of iterations for K-Means before it converges
ttemplate<typename T, typename ... Ts>
std::vector<ConstPtrView>
get_view_by_birch(const char *col_name,
                  long k,
                  double threshold,
                  long max_entries = 1000,
                  long max_iter = 1000) const;
Same as above view, but it returns a std::vector of const views. You can not change data in const views. But if the data is changed in the original DataFrame or through another view, it is reflected in the const view. T: Type of the named column
Ts: List all the types of all data columns. A type should be specified in the list only once.
col_name: Name of the given column
k: Number of clusters used by K-Means algorithm
threshold: It is the maximum radius (standard deviation) allowed for a CF (Clustering Feature) entry
max_entries: It is the maximum number of CF (Clustering Feature) entries that can be stored in the flat CF-Tree before it needs to be rebuilt
max_iter: It is the maximum number of iterations for K-Means before it converges
void test_get_data_by_birch()  {

    std::cout << "\nTesting get_data_by_birch( ) ..." << std::endl;

    ULDataFrame df;

    try  {
        df.read("FORD.csv", io_format::csv2);
    }
    catch (const DataFrameError &ex)  {
        std::cout << ex.what() << std::endl;
        ::exit(-1);
    }

    auto    lbd = [](const unsigned long &, const double &) -> bool { return (true); };
    auto    view = df.get_view_by_sel<double, decltype(lbd), double, long>("FORD_Open", lbd);

    // I am using both views and dataframes to make sure both work
    //
    auto    views = view.get_view_by_birch<double, double, long>("FORD_Close", 4, 2.5);
    auto    dfs = df.get_data_by_birch<double, double, long>("FORD_Close", 4, 2.5);

    assert(views.size() == 4);
    assert(dfs.size() == 4);

    assert(views[0].get_index().size() == 4367);
    assert(dfs[0].get_index().size() == 4367);
    assert(views[1].get_index().size() == 4450);
    assert(dfs[1].get_index().size() == 4450);
    assert(views[2].get_index().size() == 2575);
    assert(dfs[2].get_index().size() == 2575);
    assert(views[3].get_index().size() == 873);
    assert(dfs[3].get_index().size() == 873);

    assert((std::fabs(views[0].get_column<double>("FORD_Close")[7] - 2.08) < 0.01));
    assert((std::fabs(dfs[1].get_column<double>("FORD_Open")[15] - 6.91) < 0.01));
    assert((std::fabs(views[2].get_column<double>("FORD_High")[3] - 12.07) < 0.01));
    assert(dfs[2].get_column<long>("FORD_Volume")[0] == 7512900);
    assert(views[3].get_index()[1] == 6507);
}

C++ DataFrame