Skip to content

Commit 0d6259f

Browse files
committed
#1. Convert numeric type from double to float
#2. Refactor configuration file and command line parameter #3. use SIMD for backward pass in output layer
1 parent 3866054 commit 0d6259f

23 files changed

+1520
-1459
lines changed

README.md

+67-58
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ Here is the neural network for sequence-to-sequence task. "TokenN" are from sour
2222
![](https://github.com/zhongkaifu/RNNSharp/blob/master/RNNSharpSeq2Seq.jpg)
2323

2424
## Supported Feature Types
25-
RNNSharp supports four types of feature set. They are template features, context template features, run time feature and word embedding features. These features are controlled by configuration file, the following paragraph will introduce how these feaures work.
25+
RNNSharp supports many different feature types, so the following paragraph will introduce how these feaures work.
2626

2727
## Template Features
2828

29-
Template features are generated by templates. By given templates, according corpus, the features are generated automatically. The template feature is binary feature. If the feature exists in current token, its value will be 1, otherwise, the value will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.
29+
Template features are generated by templates. By given templates and corpus, these features can be automatically generated. In RNNSharp, template features are sparse features, so if the feature exists in current token, the feature value will be 1 (or feature frequency), otherwise, it will be 0. It's similar as CRFSharp features. In RNNSharp, TFeatureBin.exe is the console tool to generate this type of features.
3030

3131
In template file, each line describes one template which consists of prefix, id and rule-string. The prefix indicates template type. So far, RNNSharp supports U-type feature, so the prefix is always as "U". Id is used to distinguish different templates. And rule-string is the feature body.
3232

@@ -93,62 +93,88 @@ U15:Care/VBP
9393

9494
Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.
9595

96-
In feature configuration file, keyword TFEATURE_FILENAME is the file name of template feature set in binary format
97-
9896
## Context Template Features
9997

10098
Context template features are based on template features and combined with context. In this example, if the context setting is "-1,0,1", the feature will combine the features of current token with its previous token and next token. For instance, if the sentence is "how are you". the generated feature set will be {Feature("how"), Feature("are"), Feature("you")}.
10199

102-
In feature configuration file, keyword TFEATURE_CONTEXT is used to specify the tokens' context range for the feature.
103-
104-
## Word Embedding Features
100+
## Pretrained Features
105101

106-
Word embedding features are used to describe the features of given token. It's very useful when we only have small labeled corpus, but have lots of unlabeled corpus. This feature is generated by Txt2Vec project. With lots of unlabeled corpus, Txt2Vec is able to generate vectors for each token. Note that, the token's granularity between word embedding feature and RNN training corpus should be consistent, otherwise, tokens in training corpus are not able to be matched with the feature. For more detailed information about how to generate word embedding features, please visit Txt2Vec homepage.
102+
RNNSharp supports two types of pretrained features. The one is embedding features, and the other is auto-encoder features. Both of them are able to present a given token by a fixd-length vector. This feature is dense feature in RNNSharp.
107103

108-
In RNNSharp, this feature also supports context feature. It will combine all features of given contexts into a single word embedding feature.
104+
For embedding features, they are trained from unlabled corpus by Text2Vec project. And RNNSharp uses them as static features for each given token. However, for auto-encoder features, they are trained by RNNSharp as well, and then they can be used as dense features for other trainings. Note that, the token's granularity in pretrained features should be consistent with training corpus in main training, otherwise, some tokens will mis-match with pretrained feature.
109105

110-
In feature configuration, it has three keywords: WORDEMBEDDING_FILENAME is used to specify the encoded word embedding data file name generated by Txt2Vec. WORDEMBEDDING_CONTEXT is used to specify the token's context range. And WORDEMBEDDING_COLUMN is used to specify the column index applied the feature in corpus
106+
Likes template features, embedding feature also supports context feature. It can combine all features of given contexts into a single embedding feature. For auto-encoder features, it does not support it yet.
111107

112108
## Run Time Features
113109

114110
Compared with other features generated offline, this feature is generated in run time. It uses the result of previous tokens as run time feature for current token. This feature is only available for forward-RNN, bi-directional RNN does not support it.
115111

116-
In feature configuration, keyword RTFEATURE_CONTEXT is used to specify the context range of this feature.
112+
## Source Sequence Encoding Feature
113+
114+
This feature is only for sequence-to-sequence task. In sequence-to-sequence task, RNNSharp encodes given source sequence into a fixed-length vector, and then pass it as dense feature to generate target sequence.
115+
116+
## Configuration File
117+
118+
The configuration file describes model structure and features. In console tool, use -cfgfile as parameter to specify this file. Here is an example for sequence labeling task:
119+
120+
\#Working directory. It is the parent directory of below relatived paths.
121+
CURRENT_DIRECTORY = .
122+
123+
\#Model type. Sequence labeling (SEQLABEL) and sequence-to-sequence (SEQ2SEQ) are supported.
124+
MODEL_TYPE = SEQLABEL
125+
126+
\#Model direction. Forward and BiDirectional are supported
127+
MODEL_DIRECTION = BiDirectional
117128

118-
## Feature Configuration File
129+
\#Model file path
130+
MODEL_FILEPATH = Data\Models\ParseORG_CHS\model.bin
119131

120-
The configuration file has settings for different feature types introduced in above. Here is an example.. In console tool, use -ftrfile as parameter to specify this file.
132+
\#Hidden layers settings. BPTT, LSTM, Dropout are supported. Here are examples of these layer types
133+
\#BPTT: 200:BPTT:5 -- Layer size is 200, BPTT value is 5
134+
\#Dropout: 200:Dropout:0.5 -- Layer size is 200, Drop out ratio is 0.5
135+
\#If the model has more than one hidden layer, each layer settings are separated by comma. For example:
136+
\#"300:LSTM, 200:LSTM" means the model has two LSTM layers. The first layer size is 300, and the second layer size is 200
137+
HIDDEN_LAYER = 200:LSTM
121138

122-
\#The file name of template feature set
123-
TFEATURE_FILENAME:tfeatures
139+
\#Output layer settings. Softmax ands NCESoftmax are supported. Here is an example of NCESoftmax:
140+
\#"NCESoftmax:20" means the output layer is NCESoftmax layer and its negative sample size is 20
141+
OUTPUT_LAYER = Softmax
124142

125-
\#The context range of template feature set. In below example, the context is current token, next token and next after next token
126-
TFEATURE_CONTEXT: 0,1,2
143+
\#CRF layer settings
144+
CRF_LAYER = True
127145

128-
\#Pretrain model type: Currently, it supports two types: 'Embedding' and 'Autoencoder'. The default type is 'Embedding'.
146+
\#The file name for template feature set
147+
TFEATURE_FILENAME = Data\Models\ParseORG_CHS\tfeatures
148+
\#The context range for template feature set. In below, the context is current token, next token and next after next token
149+
TFEATURE_CONTEXT = 0,1,2
150+
\#The feature weight type. Binary and Freq are supported
151+
TFEATURE_WEIGHT_TYPE = Binary
152+
153+
\#Pretrained features type: 'Embedding' and 'Autoencoder' are supported.
129154
\#For 'Embedding', the pretrained model is trained by Text2Vec, which looks like word embedding model.
130155
\#For 'Autoencoder', the pretrained model is trained by RNNSharp itself. You need to train an auto encoder-decoder model by RNNSharp at first, and then use this pretrained model for your task.
131-
PRETRAIN_TYPE:AUTOENCODER
156+
PRETRAIN_TYPE = Embedding
132157

133158
\#The following settings are for pretrained model in 'Embedding' type.
134-
\#The word embedding model generated by Txt2Vec. If embedding model is raw text format, we should use WORDEMBEDDING_RAW_FILENAME instead of WORDEMBEDDING_FILENAME as keyword
135-
WORDEMBEDDING_FILENAME:word_vector.bin
136-
159+
\#The embedding model generated by Txt2Vec (https://github.com/zhongkaifu/Txt2Vec). If it is raw text format, we should use WORDEMBEDDING_RAW_FILENAME instead of WORDEMBEDDING_FILENAME as keyword
160+
WORDEMBEDDING_FILENAME = Data\WordEmbedding\wordvec_chs.bin
137161
\#The context range of word embedding. In below example, the context is current token, previous token and next token
162+
\#If more than one token are combined, this feature would use a plenty of memory.
138163
WORDEMBEDDING_CONTEXT: -1,0,1
164+
\#The column index applied word embedding feature
165+
WORDEMBEDDING_COLUMN = 0
139166

140-
\#The column index for word embedding feature
141-
WORDEMBEDDING_COLUMN: 0
142-
143-
\#The following settings are for pretrained model in 'Autoencoder' type.
144-
\#The auto encoder model generated by RNNSharp itself.
145-
AUTOENCODER_MODEL: D:\RNNSharpDemoPackage\AutoEncoder\model.bin
146-
167+
\#The following setting is for pretrained model in 'Autoencoder' type.
147168
\#The feature configuration file for pretrained model.
148-
AUTOENCODER_FEATURECONFIG: D:\RNNSharpDemoPackage\features_autoencoder.txt
169+
AUTOENCODER_CONFIG: D:\RNNSharpDemoPackage\config_autoencoder.txt
149170

150-
\#The context range of run time feature. In below exampl, RNNSharp will use the output of previous token as run time feature for current token
151-
RTFEATURE_CONTEXT: -1
171+
\#The following setting is the configuration file for source sequence encoder which is only for sequence-to-sequence task that MODEL_TYPE equals to SEQ2SEQ.
172+
\#In this example, since MODEL_TYPE is SEQLABEL, so we comment it out.
173+
\#SEQ2SEQ_AUTOENCODER_CONFIG: D:\RNNSharpDemoPackage\config_seq2seq_autoencoder.txt
174+
175+
\#The context range of run time feature. In below example, RNNSharp will use the output of previous token as run time feature for current token
176+
\#Note that, bi-directional model does not support run time feature, so we comment it out.
177+
\#RTFEATURE_CONTEXT: -1
152178

153179
## Training file format
154180

@@ -234,48 +260,31 @@ RNNSharpConsole.exe -mode train <parameters>
234260
Parameters for training RNN based model
235261
-trainfile <string>: training corpus file
236262
-validfile <string>: validated corpus for training
237-
-modelfile <string>: encoded model file
238-
-hiddenlayertype <string>: hidden layer type. BPTT and LSTM are supported, default is BPTT
239-
-outputlayertype <string>: output layer type. Softmax and NCESoftmax are supported, default is Softmax
240-
-ncesamplesize <int>: noise contrastive estimation(NCE) sample size, default is 15
241-
-ftrfile <string>: feature configuration file
242-
-tagfile <string>: supported output tagid-name list file
263+
-cfgfile <string>: configuration file
264+
-tagfile <string>: output tag or vocabulary file
243265
-alpha <float>: learning rate, default is 0.1
244-
-dropout <float>: hidden layer node drop out ratio, default is 0
245-
-bptt <int>: the step for back-propagation through time, default is 4
246-
-layersize <int>: the size of each hidden layer, default is 200 for a single layer. If you want to have more than one layer, each layer size is split by character ',' For example: "-layersize = 200,100" means the neural network has two hidden layers, the first hidden layer size is 200, and the second hidden layer size is 100
247-
-crf <0/1>: training model by standard RNN(0) or RNN-CRF(1), default is 0
248266
-maxiter <int>: maximum iteration for training. 0 is no limition, default is 20
249267
-savestep <int>: save temporary model after every <int> sentence, default is 0
250-
-dir <int> : RNN directional: 0 - Forward RNN, 1 - Bi-directional RNN, default is 0
251268
-vq <int> : Model vector quantization, 0 is disable, 1 is enable. default is 0
252-
-seq2seq <boolean> : Train a sequence-to-sequence model if it's true, otherwise, train a sequence labeling model. Default is false
253-
254-
Example for sequence labeling task: RNNSharpConsole.exe -mode train -trainfile train.txt -validfile valid.txt -modelfile model.bin -ftrfile features.txt -tagfile tags.txt -hiddenlayertype BPTT -outputlayertype softmax -layersize 200,100 -alpha 0.1 -crf 1 -maxiter 20 -savestep 200K -dir 1 -vq 0 -grad 15.0
255-
256-
This command trains a bi-directional recurrent neural network with CRF output. The network has two BPTT hidden layers and one softmax output layer. The first hidden layer size is 200 and the second hidden layer size is 100
257-
258-
Example for sequence-to-sequence task: RNNSharpConsole.exe -mode train -trainfile train.txt -modelfile model.bin -ftrfile features_seq2seq.txt -tagfile tags.txt -hiddenlayertype lstm -outputlayertype ncesoftmax -ncesamplesize 20 -layersize 300 -alpha 0.1 -crf 0 -maxiter 0 -savestep 200K -dir 0 -dropout 0 -seq2seq true
259269

260-
This command trains a forward-directional sequence-to-sequence LSTM model, and the output layer is negative sampling softmax. The encoder is defined in [AUTOENCODER_XXX] section in features_seq2seq.txt file.
270+
Example: RNNSharpConsole.exe -mode train -trainfile train.txt -validfile valid.txt -cfgfile config.txt -tagfile tags.txt -alpha 0.1 -maxiter 20 -savestep 200K -vq 0 -grad 15.0
261271

262272
### Decode Model
263273

264-
In this mode, the console tool is used to predict output tags of given corpus. The usage as follows:
274+
In this mode, given test corpus file, RNNSharp predicts output tags in sequence labeling task or generates a target sequence in sequence-to-sequence task.
265275

266276
RNNSharpConsole.exe -mode test <parameters>
267277
Parameters for predicting iTagId tag from given corpus
268-
-testfile <string>: training corpus file
269-
-modelfile <string>: encoded model file
270-
-tagfile <string>: supported output tagid-name list file
271-
-ftrfile <string>: feature configuration file
278+
-testfile <string>: test corpus file
279+
-tagfile <string>: output tag or vocabulary file
280+
-cfgfile <string>: configuration file
272281
-outfile <string>: result output file
273282

274-
Example: RNNSharpConsole.exe -mode test -testfile test.txt -modelfile model.bin -tagfile tags.txt -ftrfile features.txt -outfile result.txt
283+
Example: RNNSharpConsole.exe -mode test -testfile test.txt -tagfile tags.txt -cfgfile config.txt -outfile result.txt
275284

276285
## TFeatureBin
277286

278-
It's used to generate template feature set by given template and corpus files. For high performance accessing and save memory cost, the indexed feature set is built as double array in trie-tree by AdvUtils. The tool supports three modes as follows:
287+
It's used to generate template feature set by given template and corpus files. For high performance accessing and save memory cost, the indexed feature set is built as float array in trie-tree by AdvUtils. The tool supports three modes as follows:
279288

280289
TFeatureBin.exe <parameters>
281290
The tool is to generate template feature from corpus and index them into file

0 commit comments

Comments
 (0)