Poor performance in Sequential Tagging #4

bratao · 2016-02-11T22:05:24Z

Hello !

Thanks again for this awesome project !

From my understanding, the performance for text sequential tagging should be equal or more than CRFSharp or CRF++ right ?
My problem is to semantic tag a big single, continuous text ( hundreds of pages).

In CRF++ I get a error token ratio of about 0.5%. In RNNSharp I can´t get it better than 40%. a gigantic different. I tried LSTM and BPTT with CRF on or off. No luck.

This is expected for my use case, or am I doing something wrong ?

zhongkaifu · 2016-02-11T23:54:33Z

Hi Bratao,

What's feature set did you use ? I suggest you to use both lexical features and dense features, such as word embedding. Could you please share your configuration file and command line with me ? I can look into it.

For CRFSharp and CRF++, both of them should have similiar performance. In addition, both of them are able to generate huge features according unigram and bigram feature template, the number of feature set can be much larger than what's RNNSharp has.

Did you try your data set by using CRFSharp or CRF++ ?

zhongkaifu · 2016-02-12T01:18:08Z

#1: How did you generate vector.bin for word embedding features ?

#2: I saw you are using U02:%x[0,1] in template. How many columns are you using in your training corpus "bruno-data.txt“ ? Can you share a few lines as example with me ?

#3: I suggest you trying these parameters at first.
.\Bin\RNNSharpConsole.exe -mode train -trainfile bruno-data.txt -modelfile .\bruno-model.bin -validfile bruno-valid.txt -ftrfile .\config_bruno.txt -tagfile .\bruno-tags.txt -modeltype 0 -layersize 200 -alpha 0.1 -crf 0 -maxiter 20 -savestep 200K -dir 1

zhongkaifu · 2016-02-12T01:55:13Z

I think 1M file for word embedding training is not enough. You can try "Txt2VecConsole.exe -mode distance..." to verify the quality of vector.bin

Since word embedding training is totally unsupervised, you can try to use bigger corpus to train it.

bratao · 2016-02-12T02:13:49Z

@zhongkaifu , Oh, thank you for clarifying this. I will try with a bigger corpus !

My understanding was that word embedding was just a extra set of features, and without it , it would compare to CRFSharp.

Thank you again so much for the help !!! I will report if I get any success !

…hidden layer is more than 1 #2. Improve training part of bi-directional RNN. We don't re-run forward before updating weights #3. Fix bugs in Dropout layer #4. Change hidden layer settings in configuration file. #5. Refactoring code

…lled when running validation zhongkaifu#2. Support model vector quantization reduce model size to 1/4 original zhongkaifu#3. Refactoring code and speed up training zhongkaifu#4. Fixing feature extracting bug

#2. Improve training performnce ~ 300% up #3. Fix learning rate update bug #4. Apply SIMD instruction to update error in layers #5. Code refactoring

bratao closed this as completed Feb 24, 2016

zhongkaifu added a commit that referenced this issue May 3, 2017

#1. Support mini batch for training

5043eb3

#2. Improve training performnce ~ 300% up #3. Fix learning rate update bug #4. Apply SIMD instruction to update error in layers #5. Code refactoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance in Sequential Tagging #4

Poor performance in Sequential Tagging #4

bratao commented Feb 11, 2016

zhongkaifu commented Feb 11, 2016

zhongkaifu commented Feb 12, 2016

zhongkaifu commented Feb 12, 2016

bratao commented Feb 12, 2016

Poor performance in Sequential Tagging #4

Poor performance in Sequential Tagging #4

Comments

bratao commented Feb 11, 2016

zhongkaifu commented Feb 11, 2016

zhongkaifu commented Feb 12, 2016

zhongkaifu commented Feb 12, 2016

bratao commented Feb 12, 2016