| Signature | Description |
|---|---|
struct ReadSchema { using ColNameType = String64; ColNameType col_name { DF_INDEX_COL_NAME }; file_dtypes col_type { file_dtypes::ULONG }; // This is number of records in the given column. Number of records in // header or schema are for efficiently and only once allocate memory for // the given column. In other words, the number of records doesn't need to // be accurate. I could be zero or an approximation or accurate. Of course // if it is not accurate you may allocate more memory than you need or // allocate multiple times. // std::size_t num_rows { 0 }; // 0-based index of columns starting at index column at 0. // Regular columns start at 1. // int col_idx { -1 }; }; // Parameters to read() function of DataFrame // struct ReadParams { using SchemaVec = std::vector<ReadSchema>; // If true, it only reads the data columns and skips the index column // bool columns_only { false }; // Start reading data from this row number. // It only applies to csv2 and binary formats // std::size_t starting_row { 0 }; // Only read this many rows of data. // It only applies to csv2 and binary formats // std::size_t num_rows { std::numeric_limits<std::size_t>::max() }; // These are only considered in csv2 format. They are ignored in all // other formats. // // If schema is nonempty, it indicates that the caller wants to read a csv // file that was not generated by C++ DataFrame. The schema must // contain the relevant entries for each column of data. The first entry // in schema must be the index column. The schema must have an entry for // each column in the file, in the order they appear in the file. All // entries in ReadSchema struct must be set, again in the order they // appear in the file. This also allows the user to skip column(s) in the // file they don't want to read into a DataFrame. // The skip_first_line is checked after it is determined that schema is // nonempty. skip_first_line means the first line of the file is a header // that was not generated by C++ DataFrame and must be skipped. // bool skip_first_line { true }; SchemaVec schema { }; // This only applies to csv and csv2 formats. It specifies the delimiting // (separating) character. // char delim { ',' }; }; |
Parameters to the below read() functions. |
| Signature | Description | Parameters |
|---|---|---|
bool read(const char *file_name, io_format iof = io_format::csv, const ReadParams params = { }); |
It inputs the contents of a text/binary file/stream into itself (i.e. DataFrame). Currently 4 formats (i.e. csv, csv2, json, binray) are supported. See io_format documentation page NOTE: If the DataFrame that is reading the file already has existing data columns, the file data will be added to the existing DataFrame columns. If the file has a data column with the same name and type as a column in the DataFrame, the file data will replace the existing data column in the DataFrame. If the file has a data column with the same name but different type as a column in the DataFrame, the behavior is undefined. Obviously, if the DataFrame is empty none of these matters. ----------------------------------------------- CSV file format must be: INDEX:<Number of data points>:<Comma delimited list of values> <Column1 name>:<Number of data points>:<Column1 type>:<Comma delimited list of values> <Column2 name>:<Number of data points>:<Column2 type>:<Comma delimited list of values> . . . All empty lines or lines starting with # will be skipped. For examples see files in test directory ----------------------------------------------- CSV2 file format is like Panda’s csv format. In general, there are two ways to read a csv2 file. One is with user provided schema (see code sample below) and the other way is if the schema is in the file. With schema in the file, the csv2 file would not be compatible with Panda’s csv file. The csv2 format with schema in the file must be: INDEX:<Number of data points>:<Index type>:,<Column1 name>:<Number of data points>:<Column1 type>,<Column2 name>:<Number of data points>:<Column2 type>, . . . Comma delimited rows of values . . . All empty lines or lines starting with # will be skipped. In CSV2 format it is more efficient if you call read() with a filename instead of opening the file yourself and passing a stream reference. With a name, DataFrame opens the file and sets the read buffers the most efficient way. NOTE: Only in CSV2 and binary formats you can specify starting_row and num_rows. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at starting_row and continues until either num_rows rows is read or EOF is reached. ----------------------------------------------- JSON file format looks like this:
{
"INDEX":{"N":3,"T":"ulong","D":[123450,123451,123452]},
"col_3":{"N":3,"T":"double","D":[15.2,16.34,17.764]},
"col_4":{"N":3,"T":"int","D":[22,23,24]},
"col_str":{"N":3,"T":"string","D":["11","22","33"]},
"col_2":{"N":3,"T":"double","D":[8,9.001,10]},
"col_1":{"N":3,"T":"double","D":[1,2,3.456]}
}
Please note DataFrame json does not follow json spec 100%. In json, there is no particular order in dictionary fields. But in DataFrame json:
Binary format is a proprietary format, that is optimized for compressing algorithms. It also takes care of different endianness. The file is always written with the same endianness as the writing host. But it will be adjusted accordingly when reading it from a different host with a different endianness. Binary format is, by far, the fastest way to read and write large files NOTE: Only in CSV2 and binary formats you can specify starting_row and num_rows. This way you can read very large files (that don't fit into memory) in chunks and process them. In this case the reading starts at starting_row and continues until either num_rows rows is read or end-of-column is reached. ----------------------------------------------- In all formats the following data types are supported:
float -- float
double -- double
longdouble -- long double
short -- short int
ushort -- unsigned short int
int -- int
uint -- unsigned int
long -- long int
longlong -- long long int
ulong -- unsigned long int
ulonglong -- unsigned long long int
char -- char
uchar -- unsigned char
string -- std::string
string -- const char *
string -- char *
vstr32 -- Fixed-size string of 31 char length
vstr64 -- Fixed-size string of 63 char length
vstr128 -- Fixed-size string of 127 char length
vstr512 -- Fixed-size string of 511 char length
vstr1K -- Fixed-size string of 1023 char length
vstr2K -- Fixed-size string of 2047 char length
bool -- bool
DateTime -- DateTime data in format of
<Epoch seconds>.<nanoseconds>
(1516179600.874123908)
In case of csv2, csv, and binary the following additional types are also supported:
str_dbl_pair -- std::pair<std::string, double>.
The pair is printed as "<s:d>,<s:d>, ...
Where s's are strings and d's are doubles.
str_str_pair -- std::pair<std::string, std::string>.
The pair is printed as "<s1:s2>,<s1:s2>, ...
Where s's are strings.
dbl_dbl_pair -- std::pair<double, double>.
The pair is printed as "<d1:d2>,<d1:d2>, ...
Where d's are doubles.
dbl_vec -- std::vector<double>.
The vector is printed as "s[d1|d2|...]"
where s is the size of the vector and
d's are the double values.
str_vec -- std::vector<std::string>.
The vector is printed as "s[str1|str2|...]"
where s is the size of the vector
and str's are the strings.
dbl_set -- std::set<double>.
The set is printed as "s[d1|d2|...]"
where s is the size of the set
and d's are the double values.
str_set -- std::set<std::string>.
The set is printed as "s[str1|str2|...]"
where s is the size of the set
and str's are the strings.
str_dbl_map -- std::map<std::string, double>.
precision values, The map is printed
as "s{k1:v1|k2:v2|...}"
where s is the size of the map
and k's and v's are keys and values.
str_dbl_unomap -- std::unoredered_map<std::string, double>.
The map is printed as "s{k1:v1|k2:v2|...}"
where s is the size of the map and k's
and v's are keys and values.
In case of csv2 the following additional types are also supported:
DateTimeAME -- American style (MM/DD/YYYY HH:MM:SS.mmm)
DateTimeEUR -- European style (YYYY/MM/DD HH:MM:SS.mmm)
DateTimeISO -- ISO style (YYYY-MM-DD HH:MM:SS.mmm)
|
file_name: Complete path to the file iof: Specifies the I/O format. The default is CSV params: A structured parameter list specified above |
template<typename S> bool read(S &in_s, io_format iof = io_format::csv, const ReadParams params = { }); |
Same as read() above, but takes a reference to a stream NOTE: It will be more efficient if you let DataFrame to open the file by passing the filename. |
|
std::future<bool> read_async(const char *file_name, io_format iof = io_format::csv, const ReadParams params = { }); |
Same as read() above, but executed asynchronously | |
template<typename S> std::future<bool> read_async(S &in_s, io_format iof = io_format::csv, const ReadParams params = { }); |
Same as read_async() above, but takes a reference to a stream |
static void test_read() { std::cout << "\nTesting read() ..." << std::endl; MyDataFrame df_read; try { std::future<bool> fut2 = df_read.read_async("sample_data.csv"); fut2.get(); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; } df_read.write<std::ostream, int, unsigned long, double, std::string, bool>(std::cout); StdDataFrame<std::string> df_read_str; try { df_read_str.read("sample_data_string_index.csv"); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; } df_read_str.write<std::ostream, int, unsigned long, double, std::string, bool>(std::cout); StdDataFrame<DateTime> df_read_dt; try { df_read_dt.read("sample_data_dt_index.csv"); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; } df_read_dt.write<std::ostream, int, unsigned long, double, std::string, bool>(std::cout); } // ----------------------------------------------------------------------------- static void test_io_format_csv2() { std::cout << "\nTesting io_format_csv2( ) ..." << std::endl; std::vector<unsigned long> ulgvec2 = { 123450, 123451, 123452, 123450, 123455, 123450, 123449, 123450, 123451, 123450, 123452, 123450, 123455, 123450, 123454, 123450, 123450, 123457, 123458, 123459, 123450, 123441, 123442, 123432, 123450, 123450, 123435, 123450 }; std::vector<unsigned long> xulgvec2 = ulgvec2; std::vector<int> intvec2 = { 1, 2, 3, 4, 5, 3, 7, 3, 9, 10, 3, 2, 3, 14, 2, 2, 2, 3, 2, 3, 3, 3, 3, 3, 36, 2, 45, 2 }; std::vector<double> xdblvec2 = { 1.2345, 2.2345, 3.2345, 4.2345, 5.2345, 3.0, 0.9999, 10.0, 4.25, 0.009, 8.0, 2.2222, 3.3333, 11.0, 5.25, 1.009, 2.111, 9.0, 3.2222, 4.3333, 12.0, 6.25, 2.009, 3.111, 10.0, 4.2222, 5.3333 }; std::vector<double> dblvec22 = { 0.998, 0.3456, 0.056, 0.15678, 0.00345, 0.923, 0.06743, 0.1, 0.0056, 0.07865, 0.0111, 0.1002, -0.8888, 0.14, 0.0456, 0.078654, -0.8999, 0.8002, -0.9888, 0.2, 0.1056, 0.87865, -0.6999, 0.4111, 0.1902, -0.4888 }; std::vector<std::string> strvec2 = { "4% of something", "Description 4/5", "This is bad", "3.4% of GDP", "Market drops", "Market pulls back", "$15 increase", "Running fast", "C++14 development", "Some explanation", "More strings", "Bonds vs. Equities", "Almost done", "XXXX04", "XXXX2", "XXXX3", "XXXX4", "XXXX4", "XXXX5", "XXXX6", "XXXX7", "XXXX10", "XXXX11", "XXXX02", "XXXX03" }; std::vector<bool> boolvec = { true, true, true, false, false, true }; MyDataFrame df; df.load_data(std::move(ulgvec2), std::make_pair("ul_col", xulgvec2)); df.load_column("xint_col", std::move(intvec2), nan_policy::dont_pad_with_nans); df.load_column("str_col", std::move(strvec2), nan_policy::dont_pad_with_nans); df.load_column("dbl_col", std::move(xdblvec2), nan_policy::dont_pad_with_nans); df.load_column("dbl_col_2", std::move(dblvec22), nan_policy::dont_pad_with_nans); df.load_column("bool_col", std::move(boolvec), nan_policy::dont_pad_with_nans); df.write<std::ostream, int, unsigned long, double, bool, std::string>(std::cout, false, io_format::csv2); MyDataFrame df_read; try { df_read.read("csv2_format_data.csv", io_format::csv2); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; } df_read.write<std::ostream, int, unsigned long, double, bool, std::string>(std::cout, false, io_format::csv2); } // ----------------------------------------------------------------------------- static void test_DT_IBM_data() { std::cout << "\nTesting DT_IBM_data( ) ..." << std::endl; typedef StdDataFrame<DateTime> DT_DataFrame; DT_DataFrame df; df.read("DT_IBM.csv", io_format::csv2); assert(df.get_column<double>("IBM_Open")[0] == 98.4375); assert(df.get_column<double>("IBM_Close")[18] == 97.875); assert(df.get_index()[18] == DateTime(20001128)); assert(fabs(df.get_column<double>("IBM_High")[5030] - 111.8) < 0.001); assert(df.get_column<long>("IBM_Volume")[5022] == 21501100L); assert(df.get_index()[5020] == DateTime(20201016)); } // ----------------------------------------------------------------------------- static void test_reading_in_chunks() { std::cout << "\nTesting reading_in_chunks( ) ..." << std::endl; try { StrDataFrame df1; df1.read("SHORT_IBM.csv", io_format::csv2, { .starting_row = 0, .num_rows = 10 }); assert(df1.get_index().size() == 10); assert(df1.get_column<double>("IBM_Close").size() == 10); assert(df1.get_index()[0] == "2014-01-02"); assert(df1.get_index()[9] == "2014-01-15"); assert(fabs(df1.get_column<double>("IBM_Close")[0] - 185.53) < 0.0001); assert(fabs(df1.get_column<double>("IBM_Close")[9] - 187.74) < 0.0001); StrDataFrame df2; df2.read("SHORT_IBM.csv", io_format::csv2, { .starting_row = 800, .num_rows = 10 }); assert(df2.get_index().size() == 10); assert(df2.get_column<double>("IBM_Close").size() == 10); assert(df2.get_index()[0] == "2017-03-08"); assert(df2.get_index()[9] == "2017-03-21"); assert(fabs(df2.get_column<double>("IBM_Close")[0] - 179.45) < 0.0001); assert(fabs(df2.get_column<double>("IBM_Close")[9] - 173.88) < 0.0001); StrDataFrame df3; df3.read("SHORT_IBM.csv", io_format::csv2, { .starting_row = 1716, .num_rows = 10 }); assert(df3.get_index().size() == 5); assert(df3.get_column<double>("IBM_Close").size() == 5); assert(df3.get_index()[0] == "2020-10-26"); assert(df3.get_index()[4] == "2020-10-30"); assert(fabs(df3.get_column<double>("IBM_Close")[0] - 112.22) < 0.0001); assert(fabs(df3.get_column<double>("IBM_Close")[4] - 111.66) < 0.0001); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; ::exit(-1); } }
// ---------------------------------------------------------------------------- static void test_read_data_file_with_schema() { std::cout << "\nTesting test_read_data_file_with_schema{ } ..." << std::endl; MyDataFrame df1; MyDataFrame df2; ReadParams::SchemaVec schema { // First is the index column { .col_type = file_dtypes::ULONG, .num_rows = 12, .col_idx = 0 }, { "Open", file_dtypes::DOUBLE, 12, 1 }, { "High", file_dtypes::DOUBLE, 12, 2 }, { "Low", file_dtypes::DOUBLE, 12, 3 }, { "Close", file_dtypes::DOUBLE, 12, 4 }, { "Adj_Close", file_dtypes::DOUBLE, 12, 5 }, { "Volume", file_dtypes::LONG, 12, 6 }, }; try { df1.read("SchemaWithHeader.csv", io_format::csv2, { .schema = schema }); df2.read("SchemaWithoutHeader.csv", io_format::csv2, { .skip_first_line = false, .schema = schema }); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; ::exit(-1); } assert(df1.get_index().size() == 12); assert(df1.get_index()[0] == 1); assert(df1.get_index()[6] == 7); assert(df1.get_index()[11] == 12); assert(df1.get_column<double>("Close").size() == 12); assert((std::fabs(df1.get_column<double>("Close")[0] - 185.53) < 0.01)); assert((std::fabs(df1.get_column<double>("Close")[6] - 187.26) < 0.01)); assert((std::fabs(df1.get_column<double>("Close")[11] - 190.09) < 0.01)); assert(df1.get_column<long>("Volume").size() == 12); assert(df1.get_column<long>("Volume")[0] == 4546500); assert(df1.get_column<long>("Volume")[6] == 4022400); assert(df1.get_column<long>("Volume")[11] == 7644600); assert(df2.get_index().size() == 12); assert(df2.get_index()[0] == 1); assert(df2.get_index()[6] == 7); assert(df2.get_index()[11] == 12); assert(df2.get_column<double>("Close").size() == 12); assert((std::fabs(df2.get_column<double>("Close")[0] - 185.53) < 0.01)); assert((std::fabs(df2.get_column<double>("Close")[6] - 187.26) < 0.01)); assert((std::fabs(df2.get_column<double>("Close")[11] - 190.09) < 0.01)); assert(df2.get_column<long>("Volume").size() == 12); assert(df2.get_column<long>("Volume")[0] == 4546500); assert(df2.get_column<long>("Volume")[6] == 4022400); assert(df2.get_column<long>("Volume")[11] == 7644600); } // ---------------------------------------------------------------------------- static void test_read_selected_cols_from_file() { std::cout << "\nTesting test_read_selected_cols_from_file{ } ..." << std::endl; MyDataFrame df1; MyDataFrame df2; ReadParams::SchemaVec schema1 { // First is always the index column { .col_type = file_dtypes::ULONG, .num_rows = 12, .col_idx = 0 }, { "Close", file_dtypes::DOUBLE, 12, 4 }, { "Volume", file_dtypes::LONG, 12, 6 }, }; ReadParams::SchemaVec schema2 { // First is always the index column { .col_type = file_dtypes::ULONG, .num_rows = 12, .col_idx = 0 }, { "Open", file_dtypes::DOUBLE, 12, 1 }, { "Low", file_dtypes::DOUBLE, 12, 3 }, { "Close", file_dtypes::DOUBLE, 12, 4 }, }; try { df1.read("SchemaWithHeader.csv", io_format::csv2, { .schema = schema1 }); df2.read("SchemaWithoutHeader.csv", io_format::csv2, { .skip_first_line = false, .schema = schema2 }); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; ::exit(-1); } const auto df_shape1 = df1.shape(); const auto df_shape2 = df2.shape(); assert(df_shape1.first == 12); // Rows assert(df_shape1.second == 2); // Columns assert(df_shape2.first == 12); // Rows assert(df_shape2.second == 3); // Columns assert(df1.get_index()[0] == 1); assert(df1.get_index()[6] == 7); assert(df1.get_index()[11] == 12); assert(df1.get_column<double>("Close").size() == 12); assert((std::fabs(df1.get_column<double>("Close")[0] - 185.53) < 0.01)); assert((std::fabs(df1.get_column<double>("Close")[6] - 187.26) < 0.01)); assert((std::fabs(df1.get_column<double>("Close")[11] - 190.09) < 0.01)); assert(df1.get_column<long>("Volume").size() == 12); assert(df1.get_column<long>("Volume")[0] == 4546500); assert(df1.get_column<long>("Volume")[6] == 4022400); assert(df1.get_column<long>("Volume")[11] == 7644600); assert(df2.get_index()[0] == 1); assert(df2.get_index()[6] == 7); assert(df2.get_index()[11] == 12); assert(df2.get_column<double>("Close").size() == 12); assert((std::fabs(df2.get_column<double>("Close")[0] - 185.53) < 0.01)); assert((std::fabs(df2.get_column<double>("Close")[6] - 187.26) < 0.01)); assert((std::fabs(df2.get_column<double>("Close")[11] - 190.09) < 0.01)); assert(df2.get_column<double>("Open").size() == 12); assert((std::fabs(df2.get_column<double>("Open")[0] - 187.21) < 0.01)); assert((std::fabs(df2.get_column<double>("Open")[6] - 188.31) < 0.01)); assert((std::fabs(df2.get_column<double>("Open")[11] - 188.04) < 0.01)); assert(df2.get_column<double>("Low").size() == 12); assert((std::fabs(df2.get_column<double>("Low")[0] - 185.2) < 0.1)); assert((std::fabs(df2.get_column<double>("Low")[6] - 186.28) < 0.01)); assert((std::fabs(df2.get_column<double>("Low")[11] - 187.86) < 0.01)); }
// ---------------------------------------------------------------------------- static void test_io_format_csv2_with_bars() { std::cout << "\nTesting io_format_csv2_with_bars( ) ..." << std::endl; MyDataFrame df_read; try { df_read.read("csv2_format_data_with_bars.csv", io_format::csv2, { .delim = '|' }); } catch (const DataFrameError &ex) { std::cout << ex.what() << std::endl; ::exit(-1); } df_read.write<std::ostream, int, unsigned long, unsigned char, char, double, bool, std::string>(std::cout, io_format::csv2); }