Intro

Why we need to monitor models after deployment?

Production	Key Questions
Data distribution changes	Why are there sudden changes in the values of my features?
Model ownership in production	Who owns the model in production? The DevOps team? Engineers? Data Scientists?
Training-serving skew	Why is the model giving poor results in production despite our rigorous testing and validation attempts during development?
Model/concept drift	Why was my model performing well in production and suddenly the performance dipped over time?
Black box models	How can I interpret and explain my model’s predictions in line with the business objective and to relevant stakeholders?
Concerted adversaries	How can I ensure the security of my model? Is my model being attacked?
Model readiness	How will I compare results from a newer version(s) of my model against the in-production version(s)?
Pipeline health issues	Why does my training pipeline fail when executed? Why does a retraining job take so long to run?
Underperforming system	Why is the latency of my predictive service very high? Why am I getting vastly varying latencies for my different models?
Cases of extreme events (outliers)	How will I be able to track the effect and performance of my model in extreme and unplanned situations?
Data quality issues	How can I ensure the production data is being processed in the same way as the training data was?

So we have to:

monitor model performance in production: see how accurate the predictions of your model are. See if the model performance decays over time, and you should re-train it.
monitor model input/output distribution: see if the distribution of input data and features that go into the model changed? Has the predicted class distribution changed over time? Those things could be connected to the data and concept drift.
monitor model training and re-training: see learning curves, trained model predictions distribution, or confusion matrix during training and re-training.
monitor model evaluation and testing: log metrics, charts, prediction, and other metadata for your automated evaluation or testing pipelines
monitor hardware metrics: see how much CPU/GPU or Memory your models use during training and inference.
monitor CI/CD pipelines for ML: see the evaluations from your CI/CD pipeline jobs and compare them visually. In ML, the metrics often only tell you so much, and someone needs to actually see the results.

Ref:

A Comprehensive Guide on How to Monitor Your Models in Production
Best Tools to Do ML Model Monitoring

What is data drift

Data drift, or covariate shift, refers to the phenomenon where the distribution of data inputs that an ML model was trained on differs from the distribution of the data inputs that the model is applied to. This can result in the model becoming less accurate or effective at making predictions or decisions.

Ref: Data Drift vs. Concept Drift: What Is the Difference?

How is machine learning monitoring different?

Main:

Service Health Monitored by few metrics, which is also a must since have to make sure the service works.
Model Health Check how good the service provided. Evaluation metrics are based on tasks, for example, searching/recommendation system requires ranking metrics; regression requires MAE or MAP; classification requires Log loss, precision or recall.
Data quality and integrity Check catch for the data quality when receiving and generating outputs, metrics such as share or missing values or count time, value range for each kind.
Data and concept drift Check the distribution of input and output which may be effected by real-world environment

Others:

Performance by segments Check the quality metrics by segments when diverse objects/categories
Bias and fairness
Outliers
Explainability

Monitoring types

Batch Monitoring:

Online (non-batch) Monitoring:

What is monitoring scheme

Use Service: Implement the service as REST API or batch pipeline.
Obtain logs: Generate logs as local log file when using the service
Calculate Metrics based on logs
Load Metrics into a Database: save the calculated metrics into a PostgreSQL database.
Build a Dashboard: Use Grafana to build a dashboard with the data from PostgreSQL database.

Build Monitoring Scheme

Environment set up

Step 1: Build virtual environment

mkdir taxi_monitoring
cd taxi_monitoring
conda create -n py11 python=3.11
conda activate py11

Step 2: Install a requirement.txt pip install -r requirement.txt

Step 3: Build Docker compose

install Docker compose with conda install docker-compose
Write a docker-compose.yml and grafana_datasources.yaml (Under ./config) install it with docker-compose up --build
Go to Grafana by localhost:3000. with

username: `admin` 
pwd: `admin`,

then change to your own pwd. 4. Go to adminer by localhost:8080, with

System: PostgreSQL
Server: db
Username: postgres
Password: example	
Database: test

Prepare reference and model

Make the dirs for the project and prepare

conda activate py11
docker-compose down
mkdir data
mkdir models

Then write the code as baseline_model_nyc_taxi_data.ipynb

Evidently metrics calculation

See details in baseline_model_nyc_taxi_data.ipynb about data drift between reference data (training data) and current data (val data) and other metrics.

Dummy monitoring

Step 1: Code for dummy metrics calculation as dummy_metrics_calculation.py

Step 2: Build services by docker-compose with docker-compose up --build

Step 3: Check SQL on browser with localhost:8080.

Step 4: Build services on System: PostgreSQL, login info is in grafana_datasources.yaml. Here is what can be seen from sql.

Step 5: Create a new dashboard and see the results as follows

Data quality monitoring

Step 1: Compared with dummy_metrics_calculation.py, we loaded the reference data, models and the needed compared data, also report for metrics calculation; we also changed calculate_metrics_postgresql for insert necessary metrics into postgresql; all changes can be seen in evidently_report_example.py. Run python evidently_metrics_calculation.py and login localhost:8080 to see if it works.

Step 2: Build Prefect pipeline (Add @task/@flow)

Step 3: Start Prefect server and run the script with prefect server start and python evidently_metrics_calculation.py in different terminals. The results should be:

Step 4: Access Grafana with localhost:3000 and create the new dashboard.

Save Grafana Dashboard

Step 1: Create a grafana_dashboards.yaml file under ./config

Step 2: Add the following content to docker-compose file for updating dashboards info.

      - ./config/grafana_dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml:ro
      - ./dashboards:/opt/grafana/dashboards

Step 3: Create new dashboards dict and create a data_drift.json for Grafana. Copy the json data from the dashboards generated in the last step.

Step 4: Restart docker-compose with docker-compose down docker-compose up, and go to Grafana to check the dashboards we saved.

Debugging with test suites and reports

debugging_nyc_taxi_data.ipynb

Note:

grafana_1 | Error: ✗ Datasource provisioning error: read /etc/grafana/provisioning/datasources/datasource.yaml: is a directory

Solutions: docker-compose rm then restart it with docker-compose up --build

OSError: [Errno 30] Read-only file system: './data' since can not find the relative path

Solutions:

Change path './data' to absolute address

Or change the config file of jupyter notebook

jupyter notebook --generate-config 
cd /Users/amberm (YOUR_USER_NAME)/.jupyter
vim jupyter_notebook_config.py

and then change the default path to the one under your path

## The directory to use for notebooks and kernels
#  Default: ''
c.NotebookApp.notebook_dir = 'YOUR_PRO_PATH'

Install evidently package

Solutions: conda install -c conda-forge evidently

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Intro

Why we need to monitor models after deployment?

What is data drift

How is machine learning monitoring different?

Monitoring types

What is monitoring scheme

Build Monitoring Scheme

Environment set up

Prepare reference and model

Evidently metrics calculation

Dummy monitoring

Data quality monitoring

Save Grafana Dashboard

Debugging with test suites and reports

Note:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Intro

Why we need to monitor models after deployment?

What is data drift

How is machine learning monitoring different?

Monitoring types

What is monitoring scheme

Build Monitoring Scheme

Environment set up

Prepare reference and model

Evidently metrics calculation

Dummy monitoring

Data quality monitoring

Save Grafana Dashboard

Debugging with test suites and reports

Note: