Skip to content

Poor performance in Sequential Tagging #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bratao opened this issue Feb 11, 2016 · 4 comments
Closed

Poor performance in Sequential Tagging #4

bratao opened this issue Feb 11, 2016 · 4 comments

Comments

@bratao
Copy link

bratao commented Feb 11, 2016

Hello !

Thanks again for this awesome project !

From my understanding, the performance for text sequential tagging should be equal or more than CRFSharp or CRF++ right ?
My problem is to semantic tag a big single, continuous text ( hundreds of pages).

In CRF++ I get a error token ratio of about 0.5%. In RNNSharp I can´t get it better than 40%. a gigantic different. I tried LSTM and BPTT with CRF on or off. No luck.

This is expected for my use case, or am I doing something wrong ?

@zhongkaifu
Copy link
Owner

Hi Bratao,

What's feature set did you use ? I suggest you to use both lexical features and dense features, such as word embedding. Could you please share your configuration file and command line with me ? I can look into it.

For CRFSharp and CRF++, both of them should have similiar performance. In addition, both of them are able to generate huge features according unigram and bigram feature template, the number of feature set can be much larger than what's RNNSharp has.

Did you try your data set by using CRFSharp or CRF++ ?

@zhongkaifu
Copy link
Owner

#1: How did you generate vector.bin for word embedding features ?

#2: I saw you are using U02:%x[0,1] in template. How many columns are you using in your training corpus "bruno-data.txt“ ? Can you share a few lines as example with me ?

#3: I suggest you trying these parameters at first.
.\Bin\RNNSharpConsole.exe -mode train -trainfile bruno-data.txt -modelfile .\bruno-model.bin -validfile bruno-valid.txt -ftrfile .\config_bruno.txt -tagfile .\bruno-tags.txt -modeltype 0 -layersize 200 -alpha 0.1 -crf 0 -maxiter 20 -savestep 200K -dir 1

@zhongkaifu
Copy link
Owner

I think 1M file for word embedding training is not enough. You can try "Txt2VecConsole.exe -mode distance..." to verify the quality of vector.bin

Since word embedding training is totally unsupervised, you can try to use bigger corpus to train it.

@bratao
Copy link
Author

bratao commented Feb 12, 2016

@zhongkaifu , Oh, thank you for clarifying this. I will try with a bigger corpus !

My understanding was that word embedding was just a extra set of features, and without it , it would compare to CRFSharp.

Thank you again so much for the help !!! I will report if I get any success !

@bratao bratao closed this as completed Feb 24, 2016
zhongkaifu added a commit that referenced this issue Feb 5, 2017
…hidden layer is more than 1

#2. Improve training part of bi-directional RNN. We don't re-run forward before updating weights
#3. Fix bugs in Dropout layer
#4. Change hidden layer settings in configuration file.
#5. Refactoring code
airlsyn pushed a commit to airlsyn/RNNSharp that referenced this issue Feb 9, 2017
…lled when running validation

zhongkaifu#2. Support model vector quantization reduce model size to 1/4 original
zhongkaifu#3. Refactoring code and speed up training
zhongkaifu#4. Fixing feature extracting bug
zhongkaifu added a commit that referenced this issue May 3, 2017
#2. Improve training performnce ~ 300% up
#3. Fix learning rate update bug
#4. Apply SIMD instruction to update error in layers
#5. Code refactoring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants