While AutoML provides a solid foundation for generating machine-learning models, you may wish to modify or extend the capabilities of the framework to suit a specific use-case. There are two ways:
- JSON files that allow you to add custom scoring functions, add or remove models, update hyperparameters and change general configuration of an autoML run
- Within a q session you can update general configuration by passing a dictionary containing suitable key-value pairs as input.
For ease of use, v0.3.0
of the framework moves all configuration definitions to JSON file format insead of the earlier q configuration.
There are at present five files in this format.
Names in the table below are paths relative to the code/customization
directory of the repository.
folder | file | file description |
---|---|---|
scoring | scoringFunctions.json | Definitions of all the scoring functions which can be used in optimization and if the optimal model requires ascending/descending data. |
configuration | default.json | Default values for all the modifiable parameters outlined within Advanced parameter modifications and all information needed when running AutoML from command line only. |
models/modelConfig | models.json | Definition of all models supported for classification and regression tasks. |
hyperParameters | gsHyperparameters.json | Definition of hyperparameter sets used by each model when using an exhaustive search method. |
hyperParameters | rsHyperparameters.json | Definition of hyperparameter sets used by each model when using a random/pseudo-random search method. |
The following is a guide to a representative subset of supported changes to each file.
The scoring functions JSON file, scoringFunctions.json
contains a set of scoring functions defined as follows
".ml.accuracy":{
"orderFunc":"desc"
},
".ml.mse":{
"orderFunc":"asc"
},
Above, a mapping between function .ml.accuracy
and the expected ordering of results that finds the best model: in this case descending, as the best model is the one returning the largest result. For function .ml.mse
the best model is found when the mean squared error is minimized; hence the ordering function is asc
.
You are unlikely to need to change the currently-defined functions. However to use a custom model optimization function, it must be defined in this file.
The default configuration JSON file default.json
is used in two contexts:
- As the source of default parameters for
.automl.fit
- As the reference file for generating custom configuration files, which can be used in two settings:
- For running the entirety of the command-line interface logic, including instructions for data retrieval.
- To leave the default configuration file untouched and use custom configurations for domain-specific problems in a multi-user environment.
Major sections of the JSON file:
section | description |
---|---|
problemDetails |
Required when using the command-line interface to run the entirety of the .automl.fit function: information about the problem being solved required to generate a full run. |
retrievalMethods |
Required when using the command-line interface to run the entirety of the .automl.fit function: instructions for retrieving the target vector and feature data using various methods. |
problemParameters |
Defines default parameters to be used when a configuration file of this type is loaded. Problem type-specific parameters and parameters generally applicable are separated to avoid confusion. |
Example:
"problemDetails":{
"featureExtractionType":"",
"problemType" :"",
"modelName" :"",
"dataRetrievalMethod" :{
"featureData":"",
"targetData" :""
}
Parameters:
key | value | options |
---|---|---|
featureExtractionType | method for generating features | normal | fresh | nlp |
problemType | form of problem | class | reg |
modelName | unique name for model (optional) | |
dataRetrievalMethod | how to retrieve feature and target data | ipc | csv | binary |
⚠️ Command line onlyThe parameters above (apart from
modelName
) can be invoked only when running the entirety of the AutoML framework from the command line. These entries are ignored when running from q directly, or updating the default configuration on the command line.
Structure:
"ipc":{
"featureData":{
"port" :"",
"select":""
},
"targetData":{
"port" :"",
"select":"",
"targetColumn":""
}
},
"csv":{
"featureData":{
"schema" :"",
"separator":"",
"directory":"",
"fileName" :""
},
"targetData":{
"schema" :"",
"separator" :"",
"directory" :"",
"fileName" :"",
"targetColumn":""
}
},
"binary":{
"featureData":{
"directory":"",
"fileName" :""
},
"targetData":{
"directory" :"",
"fileName" :"",
"targetColumn":""
}
}
ipc
: Retrieve feature and target data via localhost IPC
key | type | value | example |
---|---|---|---|
port |
long | localhost port from which to retrieve the dataset | 5000 |
select |
string | q expression to evaluate on the port to retrieve data | select from tab where x<100 |
targetColumn |
string | when feature and target datasets are from the same port and select expression, retrieve the data once and select this target column |
target |
csv
: Retrieve feature and target data from a CSV file
key | type | value | example/s |
---|---|---|---|
schema |
string | schema for CSV data from disk, as per Tok | "SPJJFFI" |
separator |
char | CSV field separator | "," | "/\t" |
directory |
string | filepath to the CSV’s parent directory, relative to the directory from which AutoML was loaded | "testDirectory/test1" |
fileName |
string | name of the CSV | "testFile.csv" |
targetColumn |
string | when feature and target datasets are from the same port and select expression, retrieve the data once and select this target column |
target |
binary
: Retrieve feature and target data from a kdb+ binary file
key | type | value | example |
---|---|---|---|
directory |
string | filepath to the binary’s parent directory, relative to the directory from which AutoML was loaded | "testDirectory/test1" |
fileName |
string | name of the kdb+ binary file | "testFile.csv" |
targetColumn |
string | when feature and target datasets are from the same port and select expression, retrieve the data once and select this target column |
target |
The problemParameters
section defines default parameters for AutoML, as in the Advanced parameter modifications section.
section | parameters | examples |
---|---|---|
general | applicable for all use cases | seed testingSize |
normal | configurable for the application of normal feature extraction/use cases | trainTestSplit functions |
fresh | configurable for the application of FRESH feature extraction/use cases | aggregationColumns |
nlp | configurable for the application of NLP feature extraction/use cases | w2v |
The JSON configurations are defined in the following form, where value
is the input supplied to AutoML and type
how it is represented in q.
"aggregationColumns":{
"value":"{first cols x}",
"type" :"lambda"
},
"trainTestSplit":{
"value":".automl.utils.ttsNonShuff",
"type" :"symbol"
},
"targetLimit":{
"value":10000,
"type" :"long"
}
The models.json
file specifies the machine-learning models applied when running AutoML. Any model added to the framework must be specified, as per the FAQ, in this file, which has two sections:
- classification
- regression
Example:
"LinearSVC":{
"library":"sklearn",
"module":"svm",
"seed":true,
"type":"binary",
"apply":true
},
"BinaryKeras":{
"library":"keras",
"module":"binary",
"seed":true,
"type":"binary",
"apply":true
},
"Torch":{
"library":"torch",
"module":"NN",
"seed":true,
"type":"binary",
"apply":true
}
The following table outlines the expected inputs for each of the models defined above:
In the example above LinearSVC
, BinaryKeras
and Torch
are the names associated with the models when printing to standard out. In the case of Keras/Torch, this has no physical representation, however when using sklearn this defines the model name to be retrieved. For example, the model defined here is sklearn.svm.LinearSVC
.
Is the library from which the model is generated sklearn
, or keras
, or torch
, or theano
? Defines the logic used for model retrieval.
For sklearn
this defines the module from which the model is retrieved. For keras
, theano
, and torch
however this defines the name of the model being retrieved i.e. in the case of BinaryKeras
above the model to be retrieved must be defined as .automl.keras.binary.x
where x
defines fit
, predict
, or model
definitions within the library.
Is the model to be seeded for reproducibility or not? Models such as simple linear regressors are deterministic and as such do not need to be seeded
Set to binary
or multi
for a classification problem; otherwise reg
for a regression model.
Is this model to be applied or not in a given run? If set to false
the model will be omitted.
This defines the hyperparameters used when running the AutoML such that model optimization is completed using an exhaustive grid search. Some examples:
"AdaBoostRegressor":{
"Parameters":{
"n_estimators":[10,20,30,50,100,250],
"learning_rate":[0.1,0.25,0.5,0.75,0.9,1.0]
},
"meta":{
"typeConvert":["int","float"]
}
},
"RandomForestRegressor":{
"Parameters":{
"n_estimators":[10,20,50,100,250],
"criterion":["mse","mae"],
"min_samples_leaf":[1,2,3,4]
},
"meta":{
"typeConvert":["int","symbol","int"]
}
}
Name defined within models.json
to which the defined hyperparameters are to be applied
Hyperparameters to be applied when running the model optimization: should be an exhaustive list of parameters to be searched
Type conversion of each of the parameters to be searched. For example, for the AdaBoostRegressor
model, the parameter n_estimators
must be cast to an integer, while the learning_rate
is a float.
This defines the hyperparameter search space when applying a random or sobol-sequence-based search during optimization.
Examples:
"KNeighborsRegressor":{
"Parameters":{
"n_neighbors":[2,100],
"weights":["uniform","distance"]
},
"meta":{
"randomType":["uniform","symbol"],
"typeConvert":["long","symbol"]
}
},
"Lasso":{
"Parameters":{
"alpha":[0.1,1.0],
"normalize":[0,1],
"max_iter":[100,1000],
"tol":[0.0001,0.1]
},
"meta":{
"randomType":["uniform","boolean","uniform","uniform"],
"typeConvert":["float","boolean","long","float"]
}
}
Name defined within models.json
to which the defined hyperparameters are to be applied
Specific hyperparameters or range over which hyperparameters to be applied are defined.
Defines the type of randomization to be used within the random search. This can be one of boolean
, uniform
, loguniform
or symbol
, as defined for the Machine Learning Toolkit.
Type conversion of each of the parameters to be searched. For example in the case of the Lass
model, the parameter alpha
must be cast to a float, while the normalize
parameter is a boolean value.
Instead of modifying the above JSON files, you may prefer to modify the system behavior on explicit invocation of the function .automl.fit
.
When calling this function the final argument params
can be configured to take as input any of the parameters in problemParameters
.
Example: a dictionary to set the random seed to a value of 75, testing size to 25 percent of the total dataset, and the cross validation function to .ml.xv.kfSplit
.
q)paramKeys:`seed`testingSize`crossValidationFunction
q)paramVals:(75;0.25;.ml.xv.kfShuff)
q)paramDict:paramKeys!paramVals
q).automl.fit[([]100?1f;100?1f);100?1f;`normal;`reg;paramDict]