Feature Modules

General Preprocessing

train_valid_test_split

features.preprocessing.train_valid_test_split(X, y=None, props=None, device=None)

Separates data into train, validation, and test sets

Parameters:
  • X (tensor) – Feature data

  • y (tensor) – Target data

  • props (list[int]) – Train/Val split proportions (Test split is inferred from these numbers)

  • device (object) – Device to store data on

Returns:

SplitData object containing train/val/test splits of the data

convert_data_points

features.preprocessing.convert_data_points(data)

Converts dataframe of trend data to tensors and pads time series data

Parameters:

data (DataFrame) – Dataframe containing the trend durations, slopes, and corresponding time series data point for the sequence

Returns:

Tensors containing trend duration and slope and padded corresponding time series data points

pad_data

features.preprocessing.pad_data(data)

Pad rows of time series data with 0’s to match longest row

Parameters:

data (DataFrame) – Data to pad

Returns:

Pandas dataframe containing the padded data

extract_data

features.preprocessing.extract_data(data, num_input, num_output)

Creates sequences of m data points to predict the next n data points

Parameters:
  • data (tensor) – Data to split into subsequences

  • num_input (int) – Number of input data

  • num_output (int) – Number of output data

Returns:

Two tensors containing the input and ouput data

SplitData

class features.preprocessing.SplitData(data_labels=['X_train', 'y_train', 'X_valid', 'y_valid', 'X_test', 'y_test'])

Initializes a SplitData object that stores any number of different sets of data and can merge data with other SplitData objects with the same sets

Parameters:

data_labels (list[str], optional) – List of labels corresponding to the names of each set

Returns:

None

add(label, data)

Adds data to a specified set based on labels

Parameters:
  • label (string) – Data label type (ex: “X_train”, “X_valid” etc.)

  • data (tensor) – Data to add

Returns:

None

get(label)

Retrieves data from a specified set based on the label

Parameters:

label (string) – Data label type (ex: “X_train”, “X_valid” etc.)

Returns:

Tensor containing data from the specified set

merge(data)

Merges data from the same set label groups from a different SplitData object

Parameters:

data (SplitData object) – SplitData object

Returns:

None

Linear Approximation

LinearApproximation

class features.linear_approximation.LinearApproximation(max_error, min_segment_length, data=None, target_col=None, date_index=None)

Create a LinearApproximation object that extracts trend sequence durations and slope from raw time series data

Parameters:
  • max_error (float) – Maximum error allowed in each segment during linear approximation

  • min_segment_length (int) – Minimum number of data points in each segment during linear approximation

  • data (Pandas DataFrame) – Dataset to process

  • target_col (string or int) – Name or index of target column

  • date_index (string or int) – Name or index of date column

Returns:

None

add_data(data, target_col, date_index='date_index')

Load in data to process with a date index column and target prediction column

Parameters:
  • data (Pandas DataFrame) – Dataframe to process

  • target_col (string or int) – Name or index of target column

  • date_index (string or int) – Name or index of date column

Returns:

None

best_line(i, upper_bound)

Calculates end index of current window in linear approximation algorithm

Parameters:
  • i (int) – Starting index of current window

  • upper_bound (int) – Maximum size of window

Returns:

Ending index of current window

Return type:

int

bottom_up(i, j)

Performs bottom up algorithm on data[i:j] as described in: http://www.cs.ucr.edu/~eamonn/icdm-01.pdf and returns list of segments represented by indices

Parameters:
  • i (int) – Starting index of current window

  • j (int) – Ending index of current window

Returns:

segments (2-D list)

segments[i] = [starting index of segments[i], ending index of segments[i]]

Return type:

list[list]

process_data()

Transform original data to Pandas DataFrame containing information about trends. Each row in the DataFrame (trends[i]) corresponds to [trend_duration[i], trend_slope[i], original data points that make up trends[i]]

Returns:

DataFrame of processed data

save_to_csv(file_path)

Saves transformed data to csv file for use

Parameters:

file_path (string) – File path to save csv to

Returns:

If csv file was saved

Return type:

boolean

Scaler

MultiScaler

class features.scaler.MultiScaler(num_sources)

Scales multiple sources of data using the Scaler class. See Scaler for more details.

Parameters:

num_sources (int) – Number of different sources to scale

Returns:

None

fit_transform(data)

Fits and scales all data sources. Use None to fill in missing data sources.

Parameters:

data (list[tensor or dataframe or series]) – Data to fit and transform

Returns:

List of all scaled data

inverse_transform(data)

Inverse transform scaled data. Use None to fill in missing data sources.

Parameters:

data (list[tensor]) – Data to revert back to original values

Returns:

Returns list of tensors of inversely transformed values

transform(data)

Transforms all data sources according to pre-trained scalers. Use None to fill in missing data sources.

Parameters:

data (list[tensor or dataframe or series]) – Data to transform

Returns:

List of all scaled data

Scaler

class features.scaler.Scaler

Scales data using Sklearn’s MinMaxScaler. Generalized to accept tensors.

Returns:

None

fit_transform(data)

Fits and scales data

Parameters:

data (tensor or dataframe or series) – Data to fit and transform

Returns:

Tensor of scaled data values

inverse_transform(data)

Inverse transform scaled data

Parameters:

data (tensor or dataframe or series) – Data to revert back to original values

Returns:

Tensor of inversely transformed values

transform(data)

Scales data according to pretrained scaler

Parameters:

data (tensor or dataframe or series) – Data to transform

Returns:

Tensor of scaled data values