# Bucketing

**UUID:** 00000000-0000-0000-0026-000000000001

## Description

Associates a bucket number (starting at 1) for all values in a selected column. The bucket count is determined by user input. The size of the buckets is ((maximum column value - minimum column value) / bucket count).

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.bucketed*- Output

## Configurations

#### Selected Column * *[single column selection]*

The column to find minimal and maximal values for and associate a bucket number to for each value.

#### Bucket Count * *[integer]*

The amount of buckets to create. The minimum is 1.

#### Bucket Column Name * *[column name]*

The name of the additional column containing the bucket number.

# Column Summary

**UUID:** 00000000-0000-0000-0145-000000000002

## Description

Computes information about statistical means of the attributes of the data.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.metrics*- Computed Metrics with Identifier

# Correlation

**UUID:** 00000000-0000-0000-0038-000000000001

## Description

Computes the correlation matrix for the given dataset

**For more details refer to the following ****article****.**

## Input(s)

*in.input*- Input

## Output(s)

*out.output*- Correlation Matrix

## Configurations

#### Correlation Method * *[single enum selection]*

Specifies the correlation method.

#### Columns for correlation *[multiple columns selection]*

The columns selected for computing the correlation. If no column is selected here, all suitable (Double, Integer, Numeric) columns in the input will be correlated among each other.

# Distinct Summary

**UUID:** 00000000-0000-0000-0011-000000000002

## Description

Creates summaries by grouping for nominally scaled column values and counts the amount of rows for each distinct column value

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.output*- Distinct Summaries

## Configurations

#### Maximum number of distinct values (defaults to 500) *[integer]*

#### Columns to include *[multiple columns selection]*

Selected columns that are processed additionally to all columns with nominal and ordinal scaled values.

#### Enable Failsafe Mode * *[boolean]*

If you expect your data set to have a vast amount of distinct values in its cells (> 100.000), consider enabling this failsafe mode. It triggers a memory friendly version of summary computation. However, the memory friendly version will take longer due to data getting grouped multiple times. Basically, this toggle trades short execution times for stability.

# Distinct Textual Summary

**UUID:** 00000000-0000-0000-0151-000000000001

## Description

Creates summaries for every column that the statistics used are applicable for. Statistics include most frequent values, most frequent patterns (value formats, e.g. number, uppercase and lowercase combinations), amount of invalid rows (invalid value can be specified) and valid rows, amount of distinct values as well as minimum, mean and maximum value length (for textual representations). The statistics will be output of this processor.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.distincts*- Computed Textual Statistics

## Configurations

#### Invalid String (special treatment in summary) *[string]*

String value that should be treated as invalid and not be used as a distinct value. The amount of invalid cells will be tracked per column as well. Defaults to empty String

#### Distinct values to take *[integer]*

The amount of most frequent distinct values to output in the analysis. The values will be seperated ba a "|" token and can be found in the "Most_Frequent_Distinct_Values" column. Must be positive.

#### Distinct Cell formats to take *[integer]*

The amount of most frequent distinct formats to output in the analysis. The values will be seperated ba a "|" token and can be found in the "Most_Frequent_Column_Format" column. Must be positive.

#### Special Characters *[string]*

Characters that will be indicated with an "S" in the "Most_Frequent_Column_Format" column. Can also have an effect on the amount of distinct formats recorded in the "Column_Formats" column. Defaults to: /*!@#$%^&*()"{}_[]|\?/<>,

# Forecast Metrics

**UUID:** 00000000-0000-0000-0143-000000000001

## Description

Calculate different error measures from forecasts

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.outMetrics*- Metrics*out.out*- Errors

## Configurations

#### Prediction column * *[single column selection]*

Column with double value as prediction

#### Original value column * *[single column selection]*

Column with double value as original value.

#### Grouping columns *[multiple columns selection]*

Can be used to specify columns over which the performance measures are aggregated.

# Forecast Metrics For Foreach

**UUID:** 00000000-0000-0000-0143-000000000002

## Description

Calculate different error measures from forecasts generated in a foreach branch. Outputs a single-row data set with information (first value of selected column) about the foreach run it was produced in.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Input carried through*out.metrics*- Computed Metrics with Identifier

## Configurations

#### Prediction column * *[single column selection]*

Column with double value as prediction

#### Original value column * *[single column selection]*

Column with double value as original value.

#### Identifier for the foreach run. Select the same column as in the Foreach Destinct Processor preceding this Processor! * *[single column selection]*

The selected column is assumed to always have the same value in all rows of the input data set. The value of the column is used to identify the foreach-run it has been produced in.

# Heuristic Summaries

**UUID:** 00000000-0000-0000-0004-000000000002

## Description

Computes information about statistical means of the attributes of the data.

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.output*- Summary Results

## Configurations

#### Compression size. *[integer]*

Can override the logarithmic compression with a positive fixed value. Defaults to 0 which triggers logarithmic compression. Will ignore values <= 0. Override is experimental!

#### Merge Interval *[integer]*

Interval for local AVLTreeDigest merges. Defaults to 1000. Override with positive integer values. Override is experimental!

#### Grouping columns *[multiple columns selection]*

Column which is used for grouping the dataset before computing the statistical means.

# Row Count

**UUID:** 00000000-0000-0000-0027-000000000001

## Description

Counts the (distinct) rows in the dataset

**For more details refer to the following ****article****.**

## Input(s)

*in.data*- Input

## Output(s)

*out.out*- Count Output

## Configurations

#### Create distinct row count (additionally to overall row count) * *[boolean]*

When toggled, distinct rows are counted, too.

# Summaries (Deprecated!)

**UUID:** 00000000-0000-0000-0004-000000000001
**Deprecated**: *This Processor calculates exact values but has rather slow performance. To get an impression on the data, use the Heuristic Summaries. It uses heuristics for median and percentile computation that have a high performance even on large datasets. Only use this processor if you need exact values for median and percentiles.*
**Replaced by:** *Heuristic Summaries*
**Removed:** *true*

## Description

Computes information about statistical means of the attributes of the data.

## Input(s)

*in.data*- Input

## Output(s)

*out.output*- Summary Results