Back to Documentations

Signature Description
enum class impurity_type : unsigned char  {
    // Measure how often a randomly chosen element from the set would be
    // incorrectly labeled
    // Gini = 1 - ∑Pi2
    //
    gini_index = 1,

    // Average level of "information", "surprise", or "uncertainty" inherent
    // to the variable's possible outcomes.
    // Entropy = -∑Pi log2(Pi)
    //
    info_entropy = 2,
};
Impurity type

Signature Description Parameters
#include <DataFrame/DataFrameMLVisitors.h>

template<typename T, typename I = unsigned long,
         std::size_t A = 0>
struct EntropyVisitor;

// -------------------------------------

template<typename T, typename I = unsigned long,
         std::size_t A = 0>
using ent_v = EntropyVisitor<T, I, A>;
This is a "single action visitor", meaning it is passed the whole data vector in one call and you must use the single_act_visit() interface.

This visitor calculates the rolling values of information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The concept of information entropy was introduced by Claude Shannon in his 1948 paper.
A die has higher entropy (p=1/6) versus a coin (p=1/2).
This method is an approximation and not as computationally intensive as the method below.

This works with both scalar and multidimensional (i.e. vectors or arrays) datasets.

get_result() returns the vector of info entropy. In case of a multidimensional column, it returns a vector of vectors. Each inner vector is the length of data dimension. The first "roll_count - 1" items in the result are NaN. Also, the last few items in the result carry less information because they get shorter data blocks.

explicit
EntropyVisitor(std::size_t roll_count, double log_base = 2);
        
T: Column data type.
I: Index type.
A: Memory alignment boundary for vectors. Default is system default alignment
#include <DataFrame/DataFrameMLVisitors.h>

template<typename T, typename I = unsigned long,
         std::size_t A = 0>
struct ImpurityVisitor;

// -------------------------------------

template<typename T, typename I = unsigned long,
         std::size_t A = 0>
using impu_v = ImpurityVisitor<T, I, A>;
This is a "single action visitor", meaning it is passed the whole data vector in one call and you must use the single_act_visit() interface.

This visitor calculates the rolling impurity of the data column.
Decision trees recursively split features with regard to their target variable's purity. The algorithm is designed to find the optimal point of the most predictive feature in order to split 1 dataset into 2. These 2 new datasets' target variable will be more pure than the original dataset's.
"Pure" is the key word here, however. What does that word mean, exactly? In a general sense "purity" can be thought of as how homogenized a group is. But homogeneity can mean different things depending on which mathematical backbone your decision tree runs on. The 2 most popular backbones for decision tree's decisions are Gini Index and Information Entropy.

This works with both scalar and multidimensional (i.e. vectors or arrays) datasets.
NOTE: In case of MD input, input is treated as multidimensional vectors of categorical tokens, not numeric vectors. So the input columns are labels not points in a n-dimensional space. See code sample below.

This method is computationally intensive. It dynamically finds the probability of each feature during each rolling period and applies the formula for the selected method.
Type T must be hash-able

ImpurityVisitor(size_type roll_count, impurity_type it);
        
T: Column data type.
I: Index type.
A: Memory alignment boundary for vectors. Default is system default alignment
static void test_EntropyVisitor()  {

    std::cout << "\nTesting EntropyVisitor{  } ..." << std::endl;

    StlVecType<unsigned long>  idx =
        { 123450, 123451, 123452, 123453, 123454, 123455, 123456, 123457, 123458, 123459, 123460, 123461, 123462, 123466, 123467, 123468,
          123469, 123470, 123471, 123472, 123473, 22, 23, 24, 25, 26, 27, 28
        };
    StlVecType<double>         close =
        { 1.80, 2.80, 1.90, 14.00, 1.10, 6.00, 13.00, 8.00, 9.00, 2.80, 1.90, 4.30, 20.00, 1.85, 3.00, 34.00, 67.00, 23.00, 87.00, 9.00, 45.00,
          1.00, 11.00, 456.00, 34.00, 7.00, 7778.00, 5.00
        };
    MyDataFrame                 df;

    df.load_data(std::move(idx), std::make_pair("close", close));

    EntropyVisitor<double, unsigned long, 64>  e_v (5);

    df.single_act_visit<double>("close", e_v);

    assert(e_v.get_result().size() == 28);
    assert(std::isnan(e_v.get_result()[0]));
    assert(std::isnan(e_v.get_result()[3]));
    assert(std::abs(e_v.get_result()[4] - 2.18974) < 0.00001);
    assert(std::abs(e_v.get_result()[6] - 1.98477) < 0.00001);
    assert(std::abs(e_v.get_result()[10] - 1.7154) < 0.0001);
    assert(std::abs(e_v.get_result()[23] - 0.596666) < 0.00001);
    assert(std::abs(e_v.get_result()[21] - 0.822228) < 0.00001);
    assert(std::abs(e_v.get_result()[18] - 1.49397) < 0.0001);
    assert(std::abs(e_v.get_result()[26] - 0.08568) < 0.0001);
    assert(std::abs(e_v.get_result()[27] - 0.00646) < 0.0001);

    // Now multidimensional data
    //
    constexpr std::size_t   dim { 3 };

    using ary_col_t = std::array<double, dim>;
    using vec_col_t = std::vector<double>;

    StlVecType<ary_col_t>   ary_md_x  {
        { 1.0, 2.0, 1.0 }, { 2.0, 1.0, 3.0 }, { 3.0, 4.0, 2.0 }, { 2.0, 1.0, 2.0 }, { 1.0, 3.0, 1.0 }, { 4.0, 2.0, 4.0 },
        { 2.0, 1.0, 3.0 }, { 3.0, 3.0, 2.0 }, { 1.0, 2.0, 1.0 }, { 2.0, 4.0, 3.0 },
    };
    StlVecType<vec_col_t>   vec_md_x  {
        { 1.0, 2.0, 1.0 }, { 2.0, 1.0, 3.0 }, { 3.0, 4.0, 2.0 }, { 2.0, 1.0, 2.0 }, { 1.0, 3.0, 1.0 }, { 4.0, 2.0, 4.0 },
        { 2.0, 1.0, 3.0 }, { 3.0, 3.0, 2.0 }, { 1.0, 2.0, 1.0 }, { 2.0, 4.0, 3.0 },
    };

    df.load_column<ary_col_t>("ARY MD COL", std::move(ary_md_x), nan_policy::dont_pad_with_nans);
    df.load_column<vec_col_t>("VEC MD COL", std::move(vec_md_x), nan_policy::dont_pad_with_nans);

    EntropyVisitor<ary_col_t, unsigned long, 64>    ary_e_v (4);
    EntropyVisitor<vec_col_t, unsigned long, 64>    vec_e_v (4);

    df.single_act_visit<ary_col_t>("ARY MD COL", ary_e_v);
    df.single_act_visit<vec_col_t>("VEC MD COL", vec_e_v);

    const auto  &ary_result { ary_e_v.get_result() };
    const auto  &vec_result { vec_e_v.get_result() };

    assert(ary_result.size() == 10);
    for (const auto &vec : ary_result)  {
        assert(vec.size() == dim);
    }
    assert(vec_result.size() == 10);
    for (const auto &vec : vec_result)  {
        assert(vec.size() == dim);
    }

    assert(std::isnan(ary_result[0][0]));
    assert(std::isnan(ary_result[0][2]));
    assert(std::isnan(ary_result[2][2]));
    assert(std::isnan(ary_result[2][2]));
    assert(std::abs(ary_result[3][0] - 1.88598) < 0.00001);
    assert(std::abs(ary_result[3][1] - 1.76876) < 0.00001);
    assert(std::abs(ary_result[6][1] - 1.95814) < 0.00001);
    assert(std::abs(ary_result[6][2] - 1.84599) < 0.00001);

    assert(std::isnan(vec_result[0][0]));
    assert(std::isnan(vec_result[0][2]));
    assert(std::isnan(vec_result[2][2]));
    assert(std::isnan(ary_result[2][2]));
    assert(std::abs(vec_result[3][0] - 1.88598) < 0.00001);
    assert(std::abs(vec_result[3][1] - 1.76876) < 0.00001);
    assert(std::abs(vec_result[6][1] - 1.95814) < 0.00001);
    assert(std::abs(vec_result[6][2] - 1.84599) < 0.00001);
}

// -----------------------------------------------------------------------------

static void test_ImpurityVisitor()  {

    std::cout << "\nTesting ImpurityVisitor{  } ..." << std::endl;

    StlVecType<unsigned long>   idx =
        { 123450, 123451, 123452, 123453, 123454, 123455, 123456, 123457, 123458, 123459, 123460, 123461, 123462, 123466,
          123467, 123468, 123469, 123470, 123471, 123472, 123473 };
    StlVecType<std::string>     metal = { "Gold", "Gold", "Gold", "Gold" };
    StlVecType<std::string>     metal2 = { "Gold", "Silver", "Silver", "Gold" };
    StlVecType<double>          numbers = { 2.5, 2.5, 2.5, -0.1, -1.1, -0.1, -1.1, -1.1, -0.1, 34.5, -1.1, 34.5, 34.5, 34.5, 0.123, 0.123, 0.123, 0.5, 0.4, 2.5, 0.5 };
    MyDataFrame                 df;

    df.load_data(std::move(idx), std::make_pair("Numbers", numbers));
    df.load_column("Metals", std::move(metal), nan_policy::dont_pad_with_nans);
    df.load_column("Metals2", std::move(metal2), nan_policy::dont_pad_with_nans);

    impu_v<std::string> impu (4, impurity_type::gini_index);
    const auto          result = df.single_act_visit<std::string>("Metals", impu).get_result();

    assert(result.size() == 1);
    assert(result[0] == 0);

    impu_v<std::string> impu2 (4, impurity_type::info_entropy);
    const auto          result2 = df.single_act_visit<std::string>("Metals", impu2).get_result();

    assert(result2.size() == 1);
    assert(result2[0] == 0);

    impu_v<std::string> impu3 (4, impurity_type::gini_index);
    const auto          result3 = df.single_act_visit<std::string>("Metals2", impu3).get_result();

    assert(result3.size() == 1);
    assert(result3[0] == 0.5);

    impu_v<std::string> impu4 (4, impurity_type::info_entropy);
    const auto          result4 = df.single_act_visit<std::string>("Metals2", impu4).get_result();

    assert(result4.size() == 1);
    assert(result4[0] == 1.0);

    impu_v<double>  impu5 (3, impurity_type::gini_index);
    const auto      result5 = df.single_act_visit<double>("Numbers", impu5).get_result();

    assert(result5.size() == 19);
    assert(result5[0] == 0);
    assert(fabs(result5[1] - 0.4444) < 0.0001);
    assert(fabs(result5[2] - 0.6667) < 0.0001);
    assert(fabs(result5[3] - 0.4444) < 0.0001);
    assert(fabs(result5[4] - 0.4444) < 0.0001);
    assert(fabs(result5[18] - 0.6667) < 0.0001);
    assert(fabs(result5[17] - 0.6667) < 0.0001);
    assert(fabs(result5[16] - 0.6667) < 0.0001);
    assert(fabs(result5[15] - 0.4444) < 0.0001);
    assert(result5[14] == 0);
    assert(fabs(result5[13] - 0.4444) < 0.0001);

    impu_v<double>  impu6 (3, impurity_type::info_entropy);
    const auto      &result6 = df.single_act_visit<double>("Numbers", impu6).get_result();

    assert(result6.size() == 19);
    assert(result6[0] == 0);
    assert(fabs(result6[1] - 0.9183) < 0.0001);
    assert(fabs(result6[2] - 1.585) < 0.0001);
    assert(fabs(result6[3] - 0.9183) < 0.0001);
    assert(fabs(result6[4] - 0.9183) < 0.0001);
    assert(fabs(result6[18] - 1.585) < 0.0001);
    assert(fabs(result6[17] - 1.585) < 0.0001);
    assert(fabs(result6[16] - 1.585) < 0.0001);
    assert(fabs(result6[15] - 0.9183) < 0.0001);
    assert(result6[14] == 0);
    assert(fabs(result6[13] - 0.9183) < 0.0001);

    // Now multidimensional data
    // NOTE: I am treating multidimensional vectors as categorical tokens,
    //       not numeric vectors. So the input columns are labels not points
    //       in a 3-dimensional space.
    //
    constexpr std::size_t   dim { 3 };

    using ary_col_t = std::array<double, dim>;
    using vec_col_t = std::vector<double>;

    StlVecType<ary_col_t>   ary_md_x  {
        { 1.0, 2.0, 3.0 }, { 4.5, 3.0, 2.5 }, { 1.0, 2.0, 3.0 }, { 4.0, 2.5, 2.0 }, { 1.5, 4.0, 1.76 }, { 4.0, 2.5, 2.0 },
        { 2.55, 3.16, 2.001 }, { 1.5, 4.0, 1.76 }, { 2.55, 3.16, 2.001 }, { 4.5, 3.0, 2.5 },
    };
    StlVecType<vec_col_t>   vec_md_x  {
        { 1.0, 2.0, 3.0 }, { 4.5, 3.0, 2.5 }, { 1.0, 2.0, 3.0 }, { 4.0, 2.5, 2.0 }, { 1.5, 4.0, 1.76 }, { 4.0, 2.5, 2.0 },
        { 2.55, 3.16, 2.001 }, { 1.5, 4.0, 1.76 }, { 2.55, 3.16, 2.001 }, { 4.5, 3.0, 2.5 },
    };

    df.load_column<ary_col_t>("ARY MD", std::move(ary_md_x), nan_policy::dont_pad_with_nans);
    df.load_column<vec_col_t>("VEC MD", std::move(vec_md_x), nan_policy::dont_pad_with_nans);

    impu_v<ary_col_t>   md_ien_ary { 3, impurity_type::info_entropy };
    impu_v<vec_col_t>   md_ien_vec { 3, impurity_type::info_entropy };
    impu_v<ary_col_t>   md_gix_ary { 3, impurity_type::gini_index };
    impu_v<vec_col_t>   md_gix_vec { 3, impurity_type::gini_index };

    df.single_act_visit<ary_col_t>("ARY MD", md_ien_ary);
    df.single_act_visit<vec_col_t>("VEC MD", md_ien_vec);
    df.single_act_visit<ary_col_t>("ARY MD", md_gix_ary);
    df.single_act_visit<vec_col_t>("VEC MD", md_gix_vec);

    assert(md_ien_ary.get_result().size() == 8);
    assert(md_ien_vec.get_result().size() == 8);
    assert(md_gix_ary.get_result().size() == 8);
    assert(md_gix_vec.get_result().size() == 8);

    assert(std::fabs(md_ien_ary.get_result()[0] - 0.918296) < 0.000001);
    assert(std::fabs(md_ien_ary.get_result()[2] - 1.58496) < 0.00001);
    assert(std::fabs(md_ien_ary.get_result()[3] - 0.918296) < 0.000001);
    assert(std::fabs(md_ien_ary.get_result()[7] - 1.58496) < 0.00001);

    assert(std::fabs(md_ien_vec.get_result()[0] - 0.918296) < 0.000001);
    assert(std::fabs(md_ien_vec.get_result()[2] - 1.58496) < 0.00001);
    assert(std::fabs(md_ien_vec.get_result()[3] - 0.918296) < 0.000001);
    assert(std::fabs(md_ien_vec.get_result()[7] - 1.58496) < 0.00001);

    assert(std::fabs(md_gix_ary.get_result()[0] - 0.444444) < 0.000001);
    assert(std::fabs(md_gix_ary.get_result()[2] - 0.666667) < 0.000001);
    assert(std::fabs(md_gix_ary.get_result()[3] - 0.444444) < 0.000001);
    assert(std::fabs(md_gix_ary.get_result()[7] - 0.666667) < 0.000001);

    assert(std::fabs(md_gix_vec.get_result()[0] - 0.444444) < 0.000001);
    assert(std::fabs(md_gix_vec.get_result()[2] - 0.666667) < 0.000001);
    assert(std::fabs(md_gix_vec.get_result()[3] - 0.444444) < 0.000001);
    assert(std::fabs(md_gix_vec.get_result()[7] - 0.666667) < 0.000001);
}

C++ DataFrame