-
Notifications
You must be signed in to change notification settings - Fork 91
Poor performance in Sequential Tagging #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Bratao, What's feature set did you use ? I suggest you to use both lexical features and dense features, such as word embedding. Could you please share your configuration file and command line with me ? I can look into it. For CRFSharp and CRF++, both of them should have similiar performance. In addition, both of them are able to generate huge features according unigram and bigram feature template, the number of feature set can be much larger than what's RNNSharp has. Did you try your data set by using CRFSharp or CRF++ ? |
#1: How did you generate vector.bin for word embedding features ? #2: I saw you are using U02:%x[0,1] in template. How many columns are you using in your training corpus "bruno-data.txt“ ? Can you share a few lines as example with me ? #3: I suggest you trying these parameters at first. |
I think 1M file for word embedding training is not enough. You can try "Txt2VecConsole.exe -mode distance..." to verify the quality of vector.bin Since word embedding training is totally unsupervised, you can try to use bigger corpus to train it. |
@zhongkaifu , Oh, thank you for clarifying this. I will try with a bigger corpus ! My understanding was that word embedding was just a extra set of features, and without it , it would compare to CRFSharp. Thank you again so much for the help !!! I will report if I get any success ! |
…lled when running validation zhongkaifu#2. Support model vector quantization reduce model size to 1/4 original zhongkaifu#3. Refactoring code and speed up training zhongkaifu#4. Fixing feature extracting bug
Hello !
Thanks again for this awesome project !
From my understanding, the performance for text sequential tagging should be equal or more than CRFSharp or CRF++ right ?
My problem is to semantic tag a big single, continuous text ( hundreds of pages).
In CRF++ I get a error token ratio of about 0.5%. In RNNSharp I can´t get it better than 40%. a gigantic different. I tried LSTM and BPTT with CRF on or off. No luck.
This is expected for my use case, or am I doing something wrong ?
The text was updated successfully, but these errors were encountered: