Parameters that change the default behavior of .automl.fit
AutoML user-modifiable parameters
parameter | use |
---|---|
aggregationColumns | Aggregation columns (FRESH only) |
crossValidationFunction | Cross validation function |
crossValidationArgument | Number of folds/percentage of data in validation set |
functions | Functions to be applied for feature extraction |
gridSearchFunction | Grid search function |
gridSearchArgument | Number of folds/percentage of data in validation set |
holdoutSize | Size of holdout set used |
hyperparameterSearchType | Form of hyperparameter search to perform |
loggingDir | Directory to save log files in |
loggingFile | Name of logging file produced for a run |
numberTrials | Number of random/sobol hyperparameters to generate |
overWriteFiles | Overwrite any saved models or log files that exist |
predictionFunction | Fit-predict function to be applied |
pythonWarning | Should Python warning be displayed |
randomSearchFunction | Random search function |
randomSearchArgument | Number of folds/percentage of data in validation set |
savedModelName | Name assigned to a run of AutoML |
saveOption | Option for what is to be saved to disk during a run |
scoringFunctionClassification | Scoring functions for classification tasks |
scoringFunctionRegression | Scoring functions for regression tasks |
seed | Random seed to be used |
significantFeatures | Feature significance procedure to be applied to data |
targetLimit | Ignore NN models when above this number of targets |
testingSize | Size of testing set on which final model is tested |
trainTestSplit | Train-test split function to be applied |
w2v | Word2Vec embedding methodology used (NLP only) |
The other sections describe the default behavior of the framework, when the last argument of .automl.fit
is the generic null (::)
.
The argument can be used to change the default behavior. Replace the null with either
- a dictionary
- path to a JSON file
of non-default parameter values. The parameter names above are the keys of either the dictionary or the JSON object.
The parameters are illustrated below as q dictionary entries. They can also be set in JSON files.
The defaults are defined in
automl/code/customization/configuration/default.json
You can modify this file.
Or make one or more custom parameter sets: save versions of default.json
in sibling folder customConfig
:
automl
└── code
└── customization
└── configuration
├── default.json
└── customConfig
├── custom1.json
└── custom2.json
Use it (as symbol, string or file symbol) as the last argument to .automl.fit
.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Multi-classification target
target:100?5
// Feature extraction type
ftype:`normal
// Problem type
ptype:`class
// Custom configuration file
params:"custom2.json"
// Run AutoML
.automl.run[features;target;ftype;ptype;params]
Columns to be used for aggregations in FRESH
By default the aggregation column for any FRESH based feature extraction is assumed to be the first column in the dataset. In certain circumstances, this may not be sufficient and a more complex aggregation setup may be required, as outlined below.
// Characteristic vector
v:100?50
// FRESH feature table
features:([]timestamp:"p"$v;v:v;100?1f;100?1f;100?1f)
// Target vector
target:count[distinct v]?1f
// Feature extraction type
ftype:`fresh
// Problem type
ptype:`reg
// In this example we want `timestamp`v as aggregation columns
params:enlist[`aggregationColumns]!enlist`timestamp`v
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Cross-validation function and number of folds/percentage of data in validation set
crossValidationFunction
is the name of the cross-validation function to apply as a symbol and crossValidationArgument
is the associated argument – either the number of folds to apply or the percentage of data in the validation set.
By default, the cross-validation procedure being implemented is a 5-fold shuffled cross validation using the function .ml.xv.kfShuff
. You can augment this for different use cases.
For example, you could change crossValidationFunction
to .ml.xv.tsRolls
to suit a more timeseries-specific problem and change crossValidationArgument
to 7 to split the data into more folds than the default configuration.
For simplicity, where possible, use the functions within the .ml.xv
namespace for this task.
// Non-timeseries (normal) feature data
features:([]asc 100?1f;100?1f;100?1f)
// Target vector
target:asc 100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Change cross validation procedure
// Use percentage split, with 20% data in the testing set
params:`crossValidationFunction`crossValidationArgument!
(`.ml.xv.pcSplit;.2)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Custom cross-validation function
To add a custom cross-validation function to those provided, follow the guidelines for function definition.
Contact [email protected] with questions on this: it is more complicated than other customizations.
Functions to be applied for feature extraction
By default, the feature-extraction functions applied for any FRESH-based problem are those contained in .ml.fresh.params
. This comprises approximately 60 functions. To augment or apply a subset of these functions see the example below and the instructions.
By default, normal feature extraction simply entails the decomposition of any temporal types into their component parts. You can augment this to add new functionality where a list of supplied functions must input/output a simple table.
By default, feature-extraction steps taken for NLP models include parsing the text data using .nlp.newParser
and applying sentiment anaylsis, regular expression searching and named-entity recognition tagging. The text is then vectorized using a Word2Vec
model and concatenated with the created features. Normal feature extraction is then applied to any remaining non-textual columns. Much as above, you can augment the normal feature extraction.
// Characteristic vector
v:100?50
// Feature table
features:([]tm:"t"$v;asc 100?1f;100?1f;100?1f;100?1f)
// FRESH target vector
target:count[distinct v]?1f
// Feature extraction type
ftype:`fresh
// Problem type
ptype:`reg
// Select functions which only take data as input with no extra parameters
dataFuncs:select from .ml.fresh.params where pnum=0
params:enlist[`functions]!enlist dataFuncs
// Run feature extraction using user defined function table for FRESH
.automl.fit[features;target;ftype;ptype;params]
❗ Do not add data rows
A user-defined function for feature extraction should take a simple table as input and return a simple table with the desired feature-extraction procedures applied.
Changing the number of rows in the dataset will cause in errors in the pipeline.
Grid search function and number of folds/percentage of data in validation set
gridSearchFunction
is the name of the grid-search function to apply as a symbol, while gridSearchArgument
is an argument associated with this function defining either the number of folds to apply or the percentage of data in the validation set.
By default, the grid-search procedure being implemented is a 5-fold shuffled grid search using the function .automl.gs.kfshuff
.
You can augment this for different use cases.
For example, when using timeseries data, you could use a method like chain-forward grid search, .automl.gs.tschain
, in the ML Toolkit, paired with three folds.
For simplicity, use the functions within the .automl.gs
namespace for this task.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Change hyperparameter search procedure
// Use roll-forward grid search with 6 folds
params:`gridSearchFunction`gridSearchArgument!
(`.ml.gs.tsRolls;6)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Custom grid-search function
To add a custom grid search function, follow the guidelines for function definition.
Contact [email protected] with questions on this: it is more complicated than other customizations.
Size of holdout set used to validate the models run
By default the holdout set across all problem types is set to 20%. For problems with a small number of data points, you may wish increase the number of datapoints being trained on. The opposite may be true on larger datasets.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Set the holdout set to contain 10% of the dataset
params:enlist[`holdoutSize]!enlist .1
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Type of hyperparameter search to perform
By default, an exhaustive grid search is applied to the best model found for a given dataset. Random or Sobol-random methods are also available within AutoML and can be applied by changing the parameter hyperparameterSearchType
.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Change hyperparameter search procedure
// Use random search
params:enlist[`hyperparameterSearchType]!enlist`random
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
// Change hyperparameter search procedure
// Use Sobol-random search
params:enlist[`hyperparameterSearchType]!enlist`sobol
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Directory to store logging files
When .automl.utils.logging
is 1b
, this parameter sets (relative to the current directory) where a log file is stored.
By default, the log file is saved to the same directory that the reports, models, meta and images are stored.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Update logging function
.automl.updateLogging[]
q)// Check to ensure logging is enabled
q).automl.utils.logging
1b
q)// Set the logging directory to logDir
q)params:enlist[`loggingDir]!enlist"logDir"
q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]
Name of saved logging file
When .automl.utils.logging
is 1b
, this is the name of the saved log file.
By default, the log file is named as: logFile_date_time.txt
.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Update logging function
.automl.updateLogging[]
q)// Check to ensure logging is enabled
q).automl.utils.logging
1b
q)// Define the name of the logging file
q)params:enlist[`loggingFile]!enlist"logFileNew"
q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]
Number of random/Sobol-random hyperparameters to generate
For the random and Sobol-random hyperparameter methods, a user specified number of hyperparameter sets are generated for a given hyperparameter space.
For Sobol, the number of trials must equal
The default for both cases is 264.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Random search - set number of hyperparameter sets
params:`hyperparameterSearchType`numberTrials!(`random;10)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
// Sobol-random search - set number of hyperparameter sets to equal 2^n
q)show n:"j"$xexp[2;9]
512
q)params:`hyperparameterSearchType`numberTrials!(`sobol;n)
q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]
Overwrite any saved models or log files that exist
If a defined savedModelName
or loggingFile
of the same name already exists in the system, setting this parameter to 1b
will allow .automl.fit
to overwrite these files.
By default the value is 0b
and the code will exit with a warning message if the files already exist.
q)// Non-timeseries (normal) feature table
q)features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
q)// Multi-classification target
q)target:100?0b
q)// Feature extraction type
q)ftype:`normal
q)// Problem type
q)ptype:`class
q)// Use a `savedModelName` that already exists
q)params:enlist[`savedModelName]!enlist"test"
q)// AutoML returns an error because the savePath already exists
q).automl.fit[features;target;ftype;ptype;params]
Error: The savePath chosen already exists, this run will be exited
q)// Set overWriteFiles to 1b
q)show params,:enlist[`overWriteFiles]!enlist 1b
savedModelName| "test"
overWriteFiles| 1b
q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]
modelInfo| `startDate`startTime`featureExtractionType`problemType`..
predict | {[config;features]
original_print:utils.printing;
utils.printi..
Fit-predict function to be applied
Ternary fitting and prediction function for cross validation and hyperparameter search. Both models fit on a training set and return the predicted scores based on supplied scoring function.
Syntax:
myFun[func;hyperParam;data]
Where
func
is a scoring function that takes parameters and data as input and returns appropriate scorehyperParam
is a dictionary of hyperparameters to be searcheddata
is data split into training and testing sets of format((xtrn;ytrn);(xval;yval))
myFun
returns the predicted and true validation values.
By default .automl.utils.fitPredict
is used.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Define updated prediction function
fitPredictUpd:{[func;hyperParam;data]
numpyArray:.p.import[`numpy]`:Array;
preds:@[.[func[][hyperParam]`:fit;numpyArray data 0]`:predict;data[1]0]`;
(preds;data[1]1) }
params:enlist[predictionFunction]!enlist fitPredictUpd
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Display Python warnings
Boolean atom: whether Python warning messages are to be displayed to standard output.
By default this is 0b
.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?0b
// Feature extraction type
ftype:`normal
// Problem type
ptype:`class
// Set python warnings to display to standard output
params:enlist[`pythonWarnings]!enlist 1b
q).automl.fit[features;target;ftype;ptype;params]
Executing node: automlConfig
Executing node: configuration
Executing node: targetDataConfig
Executing node: targetData
Executing node: featureDataConfig
Executing node: featureData
Executing node: dataCheck
Executing node: featureDescription
Executing node: dataPreprocessing
Executing node: featureCreation
Executing node: labelEncode
Executing node: featureSignificance
Executing node: trainTestSplit
Executing node: modelGeneration
Executing node: selectModels
Executing node: runModels
Executing node: optimizeModels
/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_pe...
/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_pe...
...
Random search function and number of folds/percentage of data in validation set
randomSearchFunction
is the name of the random search function to apply as a symbol, while randomSearchArgument
is an argument associated with this function defining either the number of folds to apply or the percentage of data in the validation set.
By default, the random search procedure being implemented (assuming hyperparameterSearchType
is set to `random
) is a 5-fold shuffled random search using the function .automl.rs.kfshuff
.
You can augment this for different use cases.
For example, when using timeseries data, you might wish to use a method like chain-forward grid search, .automl.rs.tschain
, in the ML Toolkit, paired with three folds.
For simplicity, use the functions within the .automl.rs
namespace for this task.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Change hyperparameter search procedure
// Use percentage split random search with 20% validation set
params:`hyperparameterSearchType`randomSearchFunction`randomSearchArgument!
(`random;`.ml.rs.pcSplit;.2)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
// Use chain-forward Sobol-random search function with 6-folds
params:`hyperparameterSearchType`randomSearchFunction`randomSearchArgument!
(`sobol;`.ml.rs.tsChain;6)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Custom random/Sobol-random search function
To add a custom random/Sobol-random search function, follow the guidelines for function definition.
Contact [email protected] with questions on this: it is more complicated than other customizations.
Folder name where all outputs related to a run will be saved
The folder created is saved in /outputs/namedModels/
.
By default, the outputs are saved named by the start date/time of a run in the format /outputs/dateTimeModels/date/run_time
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Define the folder name where outputs are to be saved
params:enlist[`savedModelName]!enlist"exampleModel"
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Option defining what is to be saved to disk during a run
There are three options.
0 Save nothing: the models run, but nothing is persisted to disk
1 Save model/metadata only: images and report are not generated
2 Save all: reports, images, metadata and models are saved to disk
The default is (2) – save everything.
Example:
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Save only the minimal outputs
params:enlist[`saveOption]!enlist 1
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
// No outputs saved
params:enlist[`saveOption]!enlist 0
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Scoring functions used in model validation and optimization
The scoring metrics used to evaluate model performance for regression and classification tasks are defined respectively by the parameters
scoringFunctionClassification
scoringFunctionRegression
The following functions are supported within the framework at present along with the ordering which allows the best model to be chosen displayed as defined in
automl/code/customization/scoring/scoring.json
.ml Statistical analysis metrics with AutoML score order
accuracy accuracy of classification results desc
mae mean absolute error asc
mape mean absolute percentage error desc
matthewCorr matthews correlation coefficient desc
mse mean square error asc
rmse root mean square error asc
rmsle root mean square logarithmic error asc
r2Score r2-score desc
smape symmetric mean absolute error desc
sse sum squared error asc
The default values for these two parameters are
scoringFunctionRegression .ml.mse
scoringFunctionClassification .ml.accuracy
Example: modifying the regression metric
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Use Mean Average Error as scoring function
params:enlist[`scoringFunctionRegression]!enlist`.ml.mae
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
To use a custom scoring metric function, define it in the central process and add it to automl/code/customization/scoring/scoring.json
.
The function must be a binary with vector arguments:
- predicted labels
- true labels
and return the score.
Functions within the ML Toolkit which take additional parameters, such as .ml.f1Score
, can be accessed in this way and could be defined as a projection.
The seed used to ensure model reruns are consistent
By default, each run of the framework is completed with a ‘random’ seed derived from the time of a run. The seed can be set to a user-specified value to ensure results are consiustent across runs, thus allowing the impact of modifications to the pipeline to be accurately monitored.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// User-defined seed
params:enlist[`seed]!enlist 42
// Run AutoML - can run twice to show consistency
.automl.fit[features;target;ftype;ptype;params]
💡 Reproducing results
For full reproducibility between q processes of the NLP word2vec implementation, set the
PYTHONHASHSEED
environment variable upon initializing q.Linux/macOS:
PYTHONHASHSEED=0 q
Windows:set PYTHONHASHSEED=0
Feature significance function to be applied to data to reduce feature set
By default, the system will apply the feature-significance tests in the AutoML:
.automl.featureSignificance.significance
The function uses the Benjamini-Hochberg-Yekutieli (BHY) procedure to identify significant features within the dataset. If no significant columns are returned, the top 25th percentile of features will be selected.
Users can alter AutoML to apply different significance tests as shown below.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Define the function to be applied for feature significance tests
newSigFeats:{.ml.fresh.significantFeatures[x;y;.ml.fresh.kSigFeat 2]}
// Pass in new function as a symbol
params:enlist[`significantFeatures]!enlist`newSigFeats
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
An alternative function must be binary, with arguments
- simple feature table
- target vector
and return a list of table columns deemed significant.
Number of targets above which long-running models are removed
If the number of targets in the dataset exceeds this, the following models will be removed from the processing stage: keras
, svm
, neuralNetwork
The default value is 10,000.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Lower the target limit
params:enlist[`targetLimit]!enlist 1000
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Size of testing set on which final model is tested
By default the testing set across all problem types is set to 20%. For problems with a small number of data points, you may wish to increase the number of data points being trained on. The opposite may be true for larger datasets.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:`normal
// Problem type
ptype:`reg
// Set the testing set to contain 30% of the dataset
params:enlist[`testingSize]!enlist .3
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
Function used to split the data into training and testing sets
Default functions for splitting the data into training and testing sets:
problem type | function | description |
---|---|---|
Normal | .ml.trainTestSplit |
Shuffle the dataset and split into training and testing set with a defined percentage in each |
FRESH | .automl.ttsNonShuff |
Without shuffling, the dataset is split into training and testing set with defined percentage in each to ensure no time leakage |
NLP | .ml.trainTestSplit |
Shuffle the dataset and split into training and testing set with a defined percentage in each |
For specific use cases this may not be sufficient. For example if you wish to split the data such that an equal distribution of target classes occur in the training and testing sets, this could be implemented as follows.
// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Multi-classification target
target:100?5
// Feature extraction type
ftype:`normal
// Problem type
ptype:`class
// Create new TTS function
ttsStrat:{[x;y;sz]
`xtrain`ytrain`xtest`ytest!
raze(x;y)@\:/:r@'shuffle each r:(,'/){
x@(0,floor n*1-y)_neg[n]?n:count x
}[;sz]each value n@'shuffle each n:group y }
// Update parameters
params:enlist[`trainTestSplit]!enlist`ttsStrat
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
An alternative function for this must take arguments
- simple table
- target vector
- size-splitting criteria used (number folds/percentage of data in validating model)
and return a dictionary with keys `xtrain`ytrain`xtest`ytest
where the x
components are tables containing the split data and y
components are the associated target vectors.
Word2Vec method used for NLP models
Methods:
0 Continuous-Bag-of-Words (default)
1 skip-gram
🌐 A crash course in word embedding
q)// NLP feature table
q)3#table
comment ..
-----------------------------------------------------------------------------..
"If you like plot turns, this is your movie. It is impossible at any moment t..
"It's a real challenge to make a movie about a baby being devoured by wild ca..
"What a good film! Made Men is a great action movie with lots of twists and t..
q)// Binary-classification target
q)target:count[table]?0b
q)// Feature extraction type
q)ftype:`nlp
q)// Problem type
q)ptype:`class
q)// Apply skip-gram (1)
q)params:enlist[`w2v]!enlist 1
q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]