Configuring AutoML

While AutoML provides a solid foundation for generating machine-learning models, you may wish to modify or extend the capabilities of the framework to suit a specific use-case. There are two ways:

JSON files that allow you to add custom scoring functions, add or remove models, update hyperparameters and change general configuration of an autoML run
Within a q session you can update general configuration by passing a dictionary containing suitable key-value pairs as input.

JSON configuration files

For ease of use, v0.3.0 of the framework moves all configuration definitions to JSON file format insead of the earlier q configuration.

There are at present five files in this format.

Names in the table below are paths relative to the code/customization directory of the repository.

folder	file	file description
scoring	scoringFunctions.json	Definitions of all the scoring functions which can be used in optimization and if the optimal model requires ascending/descending data.
configuration	default.json	Default values for all the modifiable parameters outlined within Advanced parameter modifications and all information needed when running AutoML from command line only.
models/modelConfig	models.json	Definition of all models supported for classification and regression tasks.
hyperParameters	gsHyperparameters.json	Definition of hyperparameter sets used by each model when using an exhaustive search method.
hyperParameters	rsHyperparameters.json	Definition of hyperparameter sets used by each model when using a random/pseudo-random search method.

The following is a guide to a representative subset of supported changes to each file.

Scoring functions

The scoring functions JSON file, scoringFunctions.json contains a set of scoring functions defined as follows

".ml.accuracy":{
  "orderFunc":"desc"
  },
".ml.mse":{
  "orderFunc":"asc"
  },

Above, a mapping between function .ml.accuracy and the expected ordering of results that finds the best model: in this case descending, as the best model is the one returning the largest result. For function .ml.mse the best model is found when the mean squared error is minimized; hence the ordering function is asc.

You are unlikely to need to change the currently-defined functions. However to use a custom model optimization function, it must be defined in this file.

Default configuration

The default configuration JSON file default.json is used in two contexts:

As the source of default parameters for .automl.fit
As the reference file for generating custom configuration files, which can be used in two settings:
1. For running the entirety of the command-line interface logic, including instructions for data retrieval.
2. To leave the default configuration file untouched and use custom configurations for domain-specific problems in a multi-user environment.

Major sections of the JSON file:

section	description
`problemDetails`	Required when using the command-line interface to run the entirety of the `.automl.fit` function: information about the problem being solved required to generate a full run.
`retrievalMethods`	Required when using the command-line interface to run the entirety of the `.automl.fit` function: instructions for retrieving the target vector and feature data using various methods.
`problemParameters`	Defines default parameters to be used when a configuration file of this type is loaded. Problem type-specific parameters and parameters generally applicable are separated to avoid confusion.

`problemDetails`

Example:

"problemDetails":{
  "featureExtractionType":"",
  "problemType"          :"",
  "modelName"            :"",
  "dataRetrievalMethod"  :{
    "featureData":"",
    "targetData" :""
  }

Parameters:

key	value	options
featureExtractionType	method for generating features	normal \| fresh \| nlp
problemType	form of problem	class \| reg
modelName	unique name for model (optional)
dataRetrievalMethod	how to retrieve feature and target data	ipc \| csv \| binary

⚠️ Command line only

The parameters above (apart from modelName) can be invoked only when running the entirety of the AutoML framework from the command line. These entries are ignored when running from q directly, or updating the default configuration on the command line.

`retrievalMethods`

Structure:

"ipc":{
  "featureData":{
    "port"  :"",
    "select":""
  },
  "targetData":{
    "port"  :"",
    "select":"",
    "targetColumn":""
  }
},
"csv":{
  "featureData":{
    "schema"   :"",
    "separator":"",
    "directory":"",
    "fileName" :""
  },
  "targetData":{
    "schema"      :"",
    "separator"   :"",
    "directory"   :"",
    "fileName"    :"",
    "targetColumn":""
  }
},
"binary":{
  "featureData":{
    "directory":"",
    "fileName" :""
  },
  "targetData":{
    "directory"   :"",
    "fileName"    :"",
    "targetColumn":""
  }
}

ipc: Retrieve feature and target data via localhost IPC

key	type	value	example
`port`	long	localhost port from which to retrieve the dataset	5000
`select`	string	q expression to evaluate on the port to retrieve data	`select from tab where x<100`
`targetColumn`	string	when feature and target datasets are from the same port and `select` expression, retrieve the data once and select this target `column`	`target`

csv: Retrieve feature and target data from a CSV file

key	type	value	example/s
`schema`	string	schema for CSV data from disk, as per Tok	`"SPJJFFI"`
`separator`	char	CSV field separator	`","` \| `"/\t"`
`directory`	string	filepath to the CSV’s parent directory, relative to the directory from which AutoML was loaded	`"testDirectory/test1"`
`fileName`	string	name of the CSV	`"testFile.csv"`
`targetColumn`	string	when feature and target datasets are from the same port and `select` expression, retrieve the data once and select this target column	`target`

binary: Retrieve feature and target data from a kdb+ binary file

key	type	value	example
`directory`	string	filepath to the binary’s parent directory, relative to the directory from which AutoML was loaded	`"testDirectory/test1"`
`fileName`	string	name of the kdb+ binary file	`"testFile.csv"`
`targetColumn`	string	when feature and target datasets are from the same port and `select` expression, retrieve the data once and select this target column	`target`

`problemParameters`

The problemParameters section defines default parameters for AutoML, as in the Advanced parameter modifications section.

section	parameters	examples
general	applicable for all use cases	seed testingSize
normal	configurable for the application of normal feature extraction/use cases	trainTestSplit functions
fresh	configurable for the application of FRESH feature extraction/use cases	aggregationColumns
nlp	configurable for the application of NLP feature extraction/use cases	w2v

The JSON configurations are defined in the following form, where value is the input supplied to AutoML and type how it is represented in q.

"aggregationColumns":{
  "value":"{first cols x}",
  "type" :"lambda"
},
"trainTestSplit":{
  "value":".automl.utils.ttsNonShuff",
  "type" :"symbol"
},
"targetLimit":{
  "value":10000,
  "type" :"long"
}

Models

The models.json file specifies the machine-learning models applied when running AutoML. Any model added to the framework must be specified, as per the FAQ, in this file, which has two sections:

classification
regression

Example:

"LinearSVC":{
  "library":"sklearn",
  "module":"svm",
  "seed":true,
  "type":"binary",
  "apply":true
},
"BinaryKeras":{
  "library":"keras",
  "module":"binary",
  "seed":true,
  "type":"binary",
  "apply":true
},
"Torch":{
  "library":"torch",
  "module":"NN",
  "seed":true,
  "type":"binary",
  "apply":true
}

The following table outlines the expected inputs for each of the models defined above:

Model name

In the example above LinearSVC, BinaryKeras and Torch are the names associated with the models when printing to standard out. In the case of Keras/Torch, this has no physical representation, however when using sklearn this defines the model name to be retrieved. For example, the model defined here is sklearn.svm.LinearSVC.

`library`

Is the library from which the model is generated sklearn, or keras, or torch, or theano? Defines the logic used for model retrieval.

`module`

For sklearn this defines the module from which the model is retrieved. For keras, theano, and torch however this defines the name of the model being retrieved i.e. in the case of BinaryKeras above the model to be retrieved must be defined as .automl.keras.binary.x where x defines fit, predict, or model definitions within the library.

`seed`

Is the model to be seeded for reproducibility or not? Models such as simple linear regressors are deterministic and as such do not need to be seeded

`type`

Set to binary or multi for a classification problem; otherwise reg for a regression model.

`apply`

Is this model to be applied or not in a given run? If set to false the model will be omitted.

Grid search parameters

This defines the hyperparameters used when running the AutoML such that model optimization is completed using an exhaustive grid search. Some examples:

"AdaBoostRegressor":{
  "Parameters":{
    "n_estimators":[10,20,30,50,100,250],
    "learning_rate":[0.1,0.25,0.5,0.75,0.9,1.0]
  },
  "meta":{
    "typeConvert":["int","float"]
  }
},
"RandomForestRegressor":{
  "Parameters":{
    "n_estimators":[10,20,50,100,250],
    "criterion":["mse","mae"],
    "min_samples_leaf":[1,2,3,4]
  },
  "meta":{
   "typeConvert":["int","symbol","int"]
  }
}

Model name

Name defined within models.json to which the defined hyperparameters are to be applied

`Parameters`

Hyperparameters to be applied when running the model optimization: should be an exhaustive list of parameters to be searched

`meta` -> `typeConvert`

Type conversion of each of the parameters to be searched. For example, for the AdaBoostRegressor model, the parameter n_estimators must be cast to an integer, while the learning_rate is a float.

Random search parameters

This defines the hyperparameter search space when applying a random or sobol-sequence-based search during optimization.

Examples:

"KNeighborsRegressor":{
  "Parameters":{
    "n_neighbors":[2,100],
    "weights":["uniform","distance"]
  },
  "meta":{
    "randomType":["uniform","symbol"],
    "typeConvert":["long","symbol"]
  }
},
"Lasso":{
  "Parameters":{
    "alpha":[0.1,1.0],
    "normalize":[0,1],
    "max_iter":[100,1000],
    "tol":[0.0001,0.1]
  },
  "meta":{
    "randomType":["uniform","boolean","uniform","uniform"],
    "typeConvert":["float","boolean","long","float"]
  }
}

Model name

Name defined within models.json to which the defined hyperparameters are to be applied

`Parameters`

Specific hyperparameters or range over which hyperparameters to be applied are defined.

`meta` -> `randomType`

Defines the type of randomization to be used within the random search. This can be one of boolean, uniform, loguniform or symbol, as defined for the Machine Learning Toolkit.

`meta` -> `typeConvert`

Type conversion of each of the parameters to be searched. For example in the case of the Lass model, the parameter alpha must be cast to a float, while the normalize parameter is a boolean value.

In-process configuration

Instead of modifying the above JSON files, you may prefer to modify the system behavior on explicit invocation of the function .automl.fit. When calling this function the final argument params can be configured to take as input any of the parameters in problemParameters.

Example: a dictionary to set the random seed to a value of 75, testing size to 25 percent of the total dataset, and the cross validation function to .ml.xv.kfSplit.

q)paramKeys:`seed`testingSize`crossValidationFunction
q)paramVals:(75;0.25;.ml.xv.kfShuff)
q)paramDict:paramKeys!paramVals
q).automl.fit[([]100?1f;100?1f);100?1f;`normal;`reg;paramDict]

👉 Advanced options
👉 .automl.fit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config.md

config.md

Configuring AutoML

JSON configuration files

Scoring functions

Default configuration

`problemDetails`

`retrievalMethods`

`problemParameters`

Models

Model name

`library`

`module`

`seed`

`type`

`apply`

Grid search parameters

Model name

`Parameters`

`meta` -> `typeConvert`

Random search parameters

Model name

`Parameters`

`meta` -> `randomType`

`meta` -> `typeConvert`

In-process configuration

Files

config.md

Latest commit

History

config.md

File metadata and controls

Configuring AutoML

JSON configuration files

Scoring functions

Default configuration

problemDetails

retrievalMethods

problemParameters

Models

Model name

library

module

seed

type

apply

Grid search parameters

Model name

Parameters

meta -> typeConvert

Random search parameters

Model name

Parameters

meta -> randomType

meta -> typeConvert

In-process configuration

`problemDetails`

`retrievalMethods`

`problemParameters`

`library`

`module`

`seed`

`type`

`apply`

`Parameters`

`meta` -> `typeConvert`

`Parameters`

`meta` -> `randomType`

`meta` -> `typeConvert`