This repository contains the Exax import script for the Backblaze dataset. The script will import, type, and fix some problems with model naming.
It also contains a script to compute Backblaze's AFR (Annual Failure Rate) metric.
This script is mentioned in the PyData Global 2021 presentation "Computations as Assets - a New Approach to Reproducibility and Transparency" by Anders Berkeman, Carl Drougge, and Sofia Hörberg.
git clone https://github.com/exaxorg/import_backblaze
cd import_backblaze
You might want to have a look at the file accelerator.conf
, and set slices
to the number of CPUs you want to use.
cd data
# backblaze data
wget https://f001.backblazeb2.com/file/Backblaze-Hard-Drive-Data/data_Q1_2021.zip
...
You'll find all the data files at Backblaze. At least one file is needed to run the script. (Pick "data_Q1_2021.zip" to run the AFR calculations below!)
cd import_backblaze
python3 -m venv venv
source venv/bin/activate
pip install accelerator
ax server
cd import_backblaze
source venv/bin/activate
# Run this once to import the data
ax run import
# You can press CTRL-T while waiting for more verbose progress indication.
# Test by calculating the AFR values
ax run afr
On a three year old Lenovo T490 laptop, importing the latest zip-file takes less than 6 minutes, and calculating the AFR takes 5 seconds.
Here's a selected and sorted pick from the output:
model #drives #days #fails AFR
HGST_HMS5C4040ALE640 3168 281692 5 0.65%
HGST_HMS5C4040BLE640 12748 1146496 10 0.32%
HGST_HUH728080ALE600 1081 97027 4 1.50%
HGST_HUH721212ALE600 2605 233948 5 0.78%
HGST_HUH721212ALE604 5691 308793 6 0.71%
HGST_HUH721212ALN604 10834 974310 9 0.34%
ST4000DM000 18941 1701967 59 1.27%
ST6000DX000 886 79740 0 0.00%
ST8000DM002 9770 878106 26 1.08%
ST8000NM0055 14450 1297674 31 0.87%
ST10000NM0086 1206 108057 6 2.03%
ST12000NM0007 23036 1732307 66 1.39%
ST12000NM0008 20132 1764318 41 0.85%
ST12000NM001G 9044 704446 12 0.62%
ST14000NM001G 5990 538401 13 0.88%
ST14000NM0138 1684 135157 9 2.43%
ST16000NM001G 2460 54177 1 0.67%
TOSHIBA_MD04ABA400V 99 8910 0 0.00%
TOSHIBA_MG07ACA14TA 27372 2165421 34 0.57%
TOSHIBA_MG07ACA14TEY 406 33831 1 1.08%
TOSHIBA_MG08ACA16TEY 1014 91260 0 0.00%
WDC_WUH721414ALE6L4 8410 640767 10 0.57%
WDC_WUH721816ALE6L0 520 4680 0 0.00%
...
The AFR, drive days and failure values are the same as published by Backblaze but there are differences in the drive count column.
Exax runs a simple web server. Set a port in accelerator.conf
like this
board listen: localhost:2020
restart the server and point a browser to
http://localhost:2020
(Select another port or socket in the accelerator.conf
file if this one is already in use.)
The Backblaze dataset is of high quality. All files in the collection use the same file format, column names, and header. Exax can import all data directly from the zip-archives, which simplifies things a lot. (Each zip-archive contains two hidden directories generated by OSX, but they can be filtered out using an option to the import function.)
There are a few anomalies in the data. The capacity_bytes
column
cannot always be trusted. Sometimes it contains huge numbers,
sometimes it is negative (!). The command below lists all columns
in the import_type
dataset along with minimum and maximum values:
$ ax ds -c :import_type: | head -11
import-3277/default
Parent: import-2891
Method: modelcleaner
Previous: import-3276
Columns:
capacity_bytes int64 [-9116022715867848704, 600332565813390450]
cleanmodel ascii
date date [ 2013-04-10, 2020-12-31]
failure bool [ False, True]
model ascii
serial_number ascii
Another thing is the model
column. We've seen two issues. The
model WDC WUH721414ALE6L4
appears with both one and two spaces in
the string. And there is also a model named 00MD00
, which is
clearly incorrect.
This means that
- processing is carried out in parallel, where possible,
- project is completely traceable and reproducible, and
- everything is written in Python.
The Accelerator is an open source (Apache V2) project. See https://exax.org for more information.