You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RNNSharp supports four types of feature set. They are template features, context template features, run time feature and word embedding features. These features are controlled by configuration file, the following paragraph will introduce how these feaures work.
25
+
RNNSharp supports many different feature types, so the following paragraph will introduce how these feaures work.
26
26
27
27
## Template Features
28
28
29
-
Template features are generated by templates. By given templates, according corpus, the features are generated automatically. The template feature is binary feature. If the feature exists in current token, its value will be 1, otherwise, the value will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.
29
+
Template features are generated by templates. By given templates and corpus, these features can be automatically generated. In RNNSharp, template features are sparse features, so if the feature exists in current token, the feature value will be 1 (or feature frequency), otherwise, it will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.
30
30
31
31
In template file, each line describes one template which consists of prefix, id and rule-string. The prefix indicates template type. So far, RNNSharp supports U-type feature, so the prefix is always as "U". Id is used to distinguish different templates. And rule-string is the feature body.
32
32
@@ -93,62 +93,88 @@ U15:Care/VBP
93
93
94
94
Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.
95
95
96
-
In feature configuration file, keyword TFEATURE_FILENAME is the file name of template feature set in binary format
97
-
98
96
## Context Template Features
99
97
100
98
Context template features are based on template features and combined with context. In this example, if the context setting is "-1,0,1", the feature will combine the features of current token with its previous token and next token. For instance, if the sentence is "how are you". the generated feature set will be {Feature("how"), Feature("are"), Feature("you")}.
101
99
102
-
In feature configuration file, keyword TFEATURE_CONTEXT is used to specify the tokens' context range for the feature.
103
-
104
-
## Word Embedding Features
100
+
## Pretrained Features
105
101
106
-
Word embedding features are used to describe the features of given token. It's very useful when we only have small labeled corpus, but have lots of unlabeled corpus. This feature is generated by Txt2Vec project. With lots of unlabeled corpus, Txt2Vec is able to generate vectors for each token. Note that, the token's granularity between word embedding feature and RNN training corpus should be consistent, otherwise, tokens in training corpus are not able to be matched with the feature. For more detailed information about how to generate word embedding features, please visit Txt2Vec homepage.
102
+
RNNSharp supports two types of pretrained features. The one is embedding features, and the other is auto-encoder features. Both of them are able to present a given token by a fixd-length vector. This feature is dense feature in RNNSharp.
107
103
108
-
In RNNSharp, this feature also supports context feature. It will combine all features of given contexts into a single word embedding feature.
104
+
For embedding features, they are trained from unlabled corpus by Text2Vec project. And RNNSharp uses them as static features for each given token. However, for auto-encoder features, they are trained by RNNSharp as well, and then they can be used as dense features for other trainings. Note that, the token's granularity in pretrained features should be consistent with training corpus in main training, otherwise, some tokens will mis-match with pretrained feature.
109
105
110
-
In feature configuration, it has three keywords: WORDEMBEDDING_FILENAME is used to specify the encoded word embedding data file name generated by Txt2Vec. WORDEMBEDDING_CONTEXT is used to specify the token's context range. And WORDEMBEDDING_COLUMN is used to specify the column index applied the feature in corpus
106
+
Likes template features, embedding feature also supports context feature. It can combine all features of given contexts into a single embedding feature. For auto-encoder features, it does not support it yet.
111
107
112
108
## Run Time Features
113
109
114
110
Compared with other features generated offline, this feature is generated in run time. It uses the result of previous tokens as run time feature for current token. This feature is only available for forward-RNN, bi-directional RNN does not support it.
115
111
116
-
In feature configuration, keyword RTFEATURE_CONTEXT is used to specify the context range of this feature.
112
+
## Source Sequence Encoding Feature
113
+
114
+
This feature is only for sequence-to-sequence task. In sequence-to-sequence task, RNNSharp encodes given source sequence into a fixed-length vector, and then pass it as dense feature to generate target sequence.
115
+
116
+
## Configuration File
117
+
118
+
The configuration file describes model structure and features. In console tool, use -cfgfile as parameter to specify this file. Here is an example for sequence labeling task:
119
+
120
+
\#Working directory. It is the parent directory of below relatived paths.
121
+
CURRENT_DIRECTORY = .
122
+
123
+
\#Model type. Sequence labeling (SEQLABEL) and sequence-to-sequence (SEQ2SEQ) are supported.
124
+
MODEL_TYPE = SEQLABEL
125
+
126
+
\#Model direction. Forward and BiDirectional are supported
The configuration file has settings for different feature types introduced in above. Here is an example.. In console tool, use -ftrfile as parameter to specify this file.
132
+
\#Hidden layers settings. BPTT, LSTM, Dropout are supported. Here are examples of these layer types
133
+
\#BPTT: 200:BPTT:5 -- Layer size is 200, BPTT value is 5
134
+
\#Dropout: 200:Dropout:0.5 -- Layer size is 200, Drop out ratio is 0.5
135
+
\#If the model has more than one hidden layer, each layer settings are separated by comma. For example:
136
+
\#"300:LSTM, 200:LSTM" means the model has two LSTM layers. The first layer size is 300, and the second layer size is 200
137
+
HIDDEN_LAYER = 200:LSTM
121
138
122
-
\#The file name of template feature set
123
-
TFEATURE_FILENAME:tfeatures
139
+
\#Output layer settings. Softmax ands NCESoftmax are supported. Here is an example of NCESoftmax:
140
+
\#"NCESoftmax:20" means the output layer is NCESoftmax layer and its negative sample size is 20
141
+
OUTPUT_LAYER = Softmax
124
142
125
-
\#The context range of template feature set. In below example, the context is current token, next token and next after next token
126
-
TFEATURE_CONTEXT: 0,1,2
143
+
\#CRF layer settings
144
+
CRF_LAYER = True
127
145
128
-
\#Pretrain model type: Currently, it supports two types: 'Embedding' and 'Autoencoder'. The default type is 'Embedding'.
\#The context range for template feature set. In below, the context is current token, next token and next after next token
149
+
TFEATURE_CONTEXT = 0,1,2
150
+
\#The feature weight type. Binary and Freq are supported
151
+
TFEATURE_WEIGHT_TYPE = Binary
152
+
153
+
\#Pretrained features type: 'Embedding' and 'Autoencoder' are supported.
129
154
\#For 'Embedding', the pretrained model is trained by Text2Vec, which looks like word embedding model.
130
155
\#For 'Autoencoder', the pretrained model is trained by RNNSharp itself. You need to train an auto encoder-decoder model by RNNSharp at first, and then use this pretrained model for your task.
131
-
PRETRAIN_TYPE:AUTOENCODER
156
+
PRETRAIN_TYPE = Embedding
132
157
133
158
\#The following settings are for pretrained model in 'Embedding' type.
134
-
\#The word embedding model generated by Txt2Vec. If embedding model is raw text format, we should use WORDEMBEDDING_RAW_FILENAME instead of WORDEMBEDDING_FILENAME as keyword
135
-
WORDEMBEDDING_FILENAME:word_vector.bin
136
-
159
+
\#The embedding model generated by Txt2Vec (https://github.com/zhongkaifu/Txt2Vec). If it is raw text format, we should use WORDEMBEDDING_RAW_FILENAME instead of WORDEMBEDDING_FILENAME as keyword
\#The context range of run time feature. In below exampl, RNNSharp will use the output of previous token as run time feature for current token
151
-
RTFEATURE_CONTEXT: -1
171
+
\#The following setting is the configuration file for source sequence encoder which is only for sequence-to-sequence task that MODEL_TYPE equals to SEQ2SEQ.
172
+
\#In this example, since MODEL_TYPE is SEQLABEL, so we comment it out.
-validfile <string>: validated corpus for training
237
-
-modelfile <string>: encoded model file
238
-
-hiddenlayertype <string>: hidden layer type. BPTT and LSTM are supported, default is BPTT
239
-
-outputlayertype <string>: output layer type. Softmax and NCESoftmax are supported, default is Softmax
240
-
-ncesamplesize <int>: noise contrastive estimation(NCE) sample size, default is 15
241
-
-ftrfile <string>: feature configuration file
242
-
-tagfile <string>: supported output tagid-name list file
263
+
-cfgfile <string>: configuration file
264
+
-tagfile <string>: output tag or vocabulary file
243
265
-alpha <float>: learning rate, default is 0.1
244
-
-dropout <float>: hidden layer node drop out ratio, default is 0
245
-
-bptt <int>: the step for back-propagation through time, default is 4
246
-
-layersize <int>: the size of each hidden layer, default is 200 for a single layer. If you want to have more than one layer, each layer size is split by character ',' For example: "-layersize = 200,100" means the neural network has two hidden layers, the first hidden layer size is 200, and the second hidden layer size is 100
247
-
-crf <0/1>: training model by standard RNN(0) or RNN-CRF(1), default is 0
248
266
-maxiter <int>: maximum iteration for training. 0 is no limition, default is 20
249
267
-savestep <int>: save temporary model after every <int> sentence, default is 0
This command trains a bi-directional recurrent neural network with CRF output. The network has two BPTT hidden layers and one softmax output layer. The first hidden layer size is 200 and the second hidden layer size is 100
This command trains a forward-directional sequence-to-sequence LSTM model, and the output layer is negative sampling softmax. The encoder is defined in [AUTOENCODER_XXX] section in features_seq2seq.txt file.
In this mode, the console tool is used to predict output tags of given corpus. The usage as follows:
274
+
In this mode, given test corpus file, RNNSharp predicts output tags in sequence labeling task or generates a target sequence in sequence-to-sequence task.
265
275
266
276
RNNSharpConsole.exe -mode test <parameters>
267
277
Parameters for predicting iTagId tag from given corpus
268
-
-testfile <string>: training corpus file
269
-
-modelfile <string>: encoded model file
270
-
-tagfile <string>: supported output tagid-name list file
It's used to generate template feature set by given template and corpus files. For high performance accessing and save memory cost, the indexed feature set is built as double array in trie-tree by AdvUtils. The tool supports three modes as follows:
287
+
It's used to generate template feature set by given template and corpus files. For high performance accessing and save memory cost, the indexed feature set is built as float array in trie-tree by AdvUtils. The tool supports three modes as follows:
279
288
280
289
TFeatureBin.exe <parameters>
281
290
The tool is to generate template feature from corpus and index them into file
0 commit comments