Production | Key Questions |
---|---|
Data distribution changes | Why are there sudden changes in the values of my features? |
Model ownership in production | Who owns the model in production? The DevOps team? Engineers? Data Scientists? |
Training-serving skew | Why is the model giving poor results in production despite our rigorous testing and validation attempts during development? |
Model/concept drift | Why was my model performing well in production and suddenly the performance dipped over time? |
Black box models | How can I interpret and explain my model’s predictions in line with the business objective and to relevant stakeholders? |
Concerted adversaries | How can I ensure the security of my model? Is my model being attacked? |
Model readiness | How will I compare results from a newer version(s) of my model against the in-production version(s)? |
Pipeline health issues | Why does my training pipeline fail when executed? Why does a retraining job take so long to run? |
Underperforming system | Why is the latency of my predictive service very high? Why am I getting vastly varying latencies for my different models? |
Cases of extreme events (outliers) | How will I be able to track the effect and performance of my model in extreme and unplanned situations? |
Data quality issues | How can I ensure the production data is being processed in the same way as the training data was? |
So we have to:
- monitor model performance in production: see how accurate the predictions of your model are. See if the model performance decays over time, and you should re-train it.
- monitor model input/output distribution: see if the distribution of input data and features that go into the model changed? Has the predicted class distribution changed over time? Those things could be connected to the data and concept drift.
- monitor model training and re-training: see learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
- monitor model evaluation and testing: log metrics, charts, prediction, and other metadata for your automated evaluation or testing pipelines
- monitor hardware metrics: see how much CPU/GPU or Memory your models use during training and inference.
- monitor CI/CD pipelines for ML: see the evaluations from your CI/CD pipeline jobs and compare them visually. In ML, the metrics often only tell you so much, and someone needs to actually see the results.
Ref:
- A Comprehensive Guide on How to Monitor Your Models in Production
- Best Tools to Do ML Model Monitoring
Data drift, or covariate shift, refers to the phenomenon where the distribution of data inputs that an ML model was trained on differs from the distribution of the data inputs that the model is applied to. This can result in the model becoming less accurate or effective at making predictions or decisions.
Ref: Data Drift vs. Concept Drift: What Is the Difference?
Main:
- Service Health Monitored by few metrics, which is also a must since have to make sure the service works.
- Model Health Check how good the service provided. Evaluation metrics are based on tasks, for example, searching/recommendation system requires ranking metrics; regression requires MAE or MAP; classification requires Log loss, precision or recall.
- Data quality and integrity Check catch for the data quality when receiving and generating outputs, metrics such as share or missing values or count time, value range for each kind.
- Data and concept drift Check the distribution of input and output which may be effected by real-world environment
Others:
- Performance by segments Check the quality metrics by segments when diverse objects/categories
- Bias and fairness
- Outliers
- Explainability
Batch Monitoring:
Online (non-batch) Monitoring:
- Use Service: Implement the service as REST API or batch pipeline.
- Obtain logs: Generate logs as local log file when using the service
- Calculate Metrics based on logs
- Load Metrics into a Database: save the calculated metrics into a PostgreSQL database.
- Build a Dashboard: Use Grafana to build a dashboard with the data from PostgreSQL database.
Step 1: Build virtual environment
mkdir taxi_monitoring
cd taxi_monitoring
conda create -n py11 python=3.11
conda activate py11
Step 2: Install a requirement.txt
pip install -r requirement.txt
Step 3: Build Docker compose
- install Docker compose with
conda install docker-compose
- Write a docker-compose.yml and grafana_datasources.yaml (Under ./config) install it with
docker-compose up --build
- Go to Grafana by
localhost:3000
. with
username: `admin`
pwd: `admin`,
then change to your own pwd.
4. Go to adminer by localhost:8080
, with
System: PostgreSQL
Server: db
Username: postgres
Password: example
Database: test
Make the dirs for the project and prepare
conda activate py11
docker-compose down
mkdir data
mkdir models
Then write the code as baseline_model_nyc_taxi_data.ipynb
See details in baseline_model_nyc_taxi_data.ipynb about data drift between reference data (training data) and current data (val data) and other metrics.
Step 1: Code for dummy metrics calculation as dummy_metrics_calculation.py
Step 2: Build services by docker-compose with docker-compose up --build
Step 3: Check SQL on browser with localhost:8080
.
Step 4: Build services on System: PostgreSQL
, login info is in grafana_datasources.yaml.
Here is what can be seen from sql.
Step 5: Create a new dashboard and see the results as follows
Step 1: Compared with dummy_metrics_calculation.py, we loaded the reference data, models and the needed compared data, also report for metrics calculation; we also changed calculate_metrics_postgresql
for insert necessary metrics into postgresql; all changes can be seen in evidently_report_example.py.
Run python evidently_metrics_calculation.py
and login localhost:8080
to see if it works.
Step 2: Build Prefect pipeline (Add @task/@flow)
Step 3: Start Prefect server and run the script with prefect server start
and python evidently_metrics_calculation.py
in different terminals. The results should be:
Step 4: Access Grafana with localhost:3000
and create the new dashboard.
Step 1: Create a grafana_dashboards.yaml file under ./config
Step 2: Add the following content to docker-compose file for updating dashboards info.
- ./config/grafana_dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml:ro
- ./dashboards:/opt/grafana/dashboards
Step 3: Create new dashboards
dict and create a data_drift.json
for Grafana. Copy the json data from the dashboards generated in the last step.
Step 4: Restart docker-compose with docker-compose down docker-compose up
, and go to Grafana to check the dashboards we saved.
grafana_1 | Error: ✗ Datasource provisioning error: read /etc/grafana/provisioning/datasources/datasource.yaml: is a directory
Solutions: docker-compose rm
then restart it with docker-compose up --build
OSError: [Errno 30] Read-only file system: './data'
since can not find the relative path
Solutions:
- Change path './data' to absolute address
- Or change the config file of jupyter notebook
and then change the default path to the one under your path
jupyter notebook --generate-config cd /Users/amberm (YOUR_USER_NAME)/.jupyter vim jupyter_notebook_config.py
## The directory to use for notebooks and kernels # Default: '' c.NotebookApp.notebook_dir = 'YOUR_PRO_PATH'
- Install evidently package
Solutions:
conda install -c conda-forge evidently