Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
toy-data.zip		toy-data.zip

README.md

Toy-data Tutorial

Where you learn how to train SVM classifier and identify files by their content.

Follow this guide to generate the model files expected by ml-text library.

Prerequisites

OSX setup:

JDK 1.8
python
bash
numpy pip install numpy
libsvm cli for macports: sudo port install libsvm for homebrew: brew install libsvm

Build feature extractor:

cd tools
./gradlew :installDist

Step 1. Collect data

Create an empty directory and extract toy-data.zip there. You should get this folder tree:

toy-data\
  csv\
  json\
  predict\
  yaml\

The names of the sub-folders inside toy-data will be used as classification labels.

Step 2. Design features

The features that we will make are based on keywords that identify each file format.

In your directory, create a file named keywords.yml. This is a list of strings, in YAML format; each string is a single keyword, in quotes.

Paste these keywords to keywords.yml:

- ','
- ":\n"
- '},'

Step 3. Extract features and train the model

Still in your folder, find the path of tools/scripts, and run

tools/scripts/build_model.sh toy-data toy-model keywords.yml

When done, the output should be something like

C=2.0 g=0.25
Accuracy = 80% (8/10) (classification)

Look inside toy-model folder. The files there are what ml-text library expects:

ml.keywords - the copy of our keywords.yml
ml.labels - indexed list of labels; you can edit the names here, but keep the same order
ml.scale - scaling parameters per each feature
ml.model - the trained model, in libsvm format

If you are interested, the raw numeric features extracted from train dataset are in tmp.toy-model/*.csv files. These files then get merged and scaled into tmp.toy-model/train.scale file, from which SVM classifier is trained.

Step 4. Identify files

Now you can identify files with a command

tools/scripts/predict.sh toy-model toy-data/predict/test1.wat

This should give you something like

labels 1 0 2
1 0.685109 0.140864 0.174027
1 json

Which is JSON format with confidence 68.5%

Step 5. Play around

Load more files to toy-data, add more file formats, change keywords, and repeat model generation step. See how the keywords you use affect accuracy of prediction.

When you done training your model on your dataset, put the generated model files into your project's resources folder, where ml-text library will find them.

Well done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toy-data-tutorial

toy-data-tutorial

README.md

Toy-data Tutorial

Prerequisites

Step 1. Collect data

Step 2. Design features

Step 3. Extract features and train the model

Step 4. Identify files

Step 5. Play around

Files

toy-data-tutorial

Directory actions

More options

Directory actions

More options

Latest commit

History

toy-data-tutorial

Folders and files

parent directory

README.md

Toy-data Tutorial

Prerequisites

Step 1. Collect data

Step 2. Design features

Step 3. Extract features and train the model

Step 4. Identify files

Step 5. Play around