You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: ARCHITECTURE.md
+19-19
Original file line number
Diff line number
Diff line change
@@ -10,6 +10,7 @@ Tantivy's bread and butter is to address the problem of full-text search :
10
10
Given a large set of textual documents, and a text query, return the K-most relevant documents in a very efficient way. To execute these queries rapidly, the tantivy needs to build an index beforehand. The relevance score implemented in the tantivy is not configurable. Tantivy uses the same score as the default similarity used in Lucene / Elasticsearch, called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
11
11
12
12
But tantivy's scope does not stop there. Numerous features are required to power rich-search applications. For instance, one may want to:
13
+
13
14
- compute the count of documents matching a query in the different section of an e-commerce website,
14
15
- display an average price per meter square for a real estate search engine,
15
16
- take into account historical user data to rank documents in a specific way,
@@ -22,27 +23,28 @@ rapidly select all documents matching a given predicate (also known as a query)
22
23
collect some information about them ([See collector](#collector-define-what-to-do-with-matched-documents)).
23
24
24
25
Roughly speaking the design is following these guiding principles:
26
+
25
27
- Search should be O(1) in memory.
26
28
- Indexing should be O(1) in memory. (In practice it is just sublinear)
27
29
- Search should be as fast as possible
28
30
29
31
This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
30
32
31
-
## [core/](src/core): Index, segments, searchers.
33
+
## [core/](src/core): Index, segments, searchers
32
34
33
35
Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.
34
36
35
37
This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.
36
38
37
-
### Index and Segments...
39
+
### Index and Segments
38
40
39
-
A tantivy index is a collection of smaller independent immutable segments.
41
+
A tantivy index is a collection of smaller independent immutable segments.
40
42
Each segment contains its own independent set of data structures.
41
43
42
44
A segment is identified by a segment id that is in fact a UUID.
43
45
The file of a segment has the format
44
46
45
-
```segment-id . ext```
47
+
```segment-id . ext```
46
48
47
49
The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
48
50
@@ -52,17 +54,15 @@ On commit, one segment per indexing thread is written to disk, and the `meta.jso
52
54
53
55
For a better idea of how indexing works, you may read the [following blog post](https://fulmicoton.com/posts/behold-tantivy-part2/).
54
56
55
-
56
57
### Deletes
57
58
58
59
Deletes happen by deleting a "term". Tantivy does not offer any notion of primary id, so it is up to the user to use a field in their schema as if it was a primary id, and delete the associated term if they want to delete only one specific document.
59
60
60
61
On commit, tantivy will find all of the segments with documents matching this existing term and remove from [alive bitset file](src/fastfield/alive_bitset.rs) that represents the bitset of the alive document ids.
61
-
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.
62
+
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.
62
63
63
64
An opstamp is simply an incremental id that identifies any operation applied to the index. For instance, performing a commit or adding a document.
64
65
65
-
66
66
### DocId
67
67
68
68
Within a segment, all documents are identified by a DocId that ranges within `[0, max_doc)`.
@@ -74,6 +74,7 @@ The DocIds are simply allocated in the order documents are added to the index.
74
74
75
75
In separate threads, tantivy's index writer search for opportunities to merge segments.
76
76
The point of segment merge is to:
77
+
77
78
- eventually get rid of tombstoned documents
78
79
- reduce the otherwise ever-growing number of segments.
79
80
@@ -104,6 +105,7 @@ Tantivy's document follows a very strict schema, decided before building any ind
104
105
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
105
106
106
107
Depending on the type of the field, you can decide to
108
+
107
109
- put it in the docstore
108
110
- store it as a fast field
109
111
- index it
@@ -117,9 +119,10 @@ As of today, tantivy's schema imposes a 1:1 relationship between a field that is
117
119
118
120
This is not something tantivy supports, and it is up to the user to duplicate field / concatenate fields before feeding them to tantivy.
119
121
120
-
## General information about these data structures.
122
+
## General information about these data structures
121
123
122
124
All data structures in tantivy, have:
125
+
123
126
- a writer
124
127
- a serializer
125
128
- a reader
@@ -132,7 +135,7 @@ This conversion is done by the serializer.
132
135
Finally, the reader is in charge of offering an API to read on this on-disk read-only representation.
133
136
In tantivy, readers are designed to require very little anonymous memory. The data is read straight from an mmapped file, and loading an index is as fast as mmapping its files.
134
137
135
-
## [store/](src/store): Here is my DocId, Gimme my document!
138
+
## [store/](src/store): Here is my DocId, Gimme my document
136
139
137
140
The docstore is a row-oriented storage that, for each document, stores a subset of the fields
138
141
that are marked as stored in the schema. The docstore is compressed using a general-purpose algorithm
@@ -146,6 +149,7 @@ Once the top 10 documents have been identified, we fetch them from the store, an
146
149
**Not useful for**
147
150
148
151
Fetching a document from the store is typically a "slow" operation. It usually consists in
152
+
149
153
- searching into a compact tree-like data structure to find the position of the right block.
150
154
- decompressing a small block
151
155
- returning the document from this block.
@@ -154,16 +158,15 @@ It is NOT meant to be called for every document matching a query.
154
158
155
159
As a rule of thumb, if you hit the docstore more than 100 times per search query, you are probably misusing tantivy.
156
160
157
-
158
-
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value!
161
+
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value
159
162
160
163
Fast fields are stored in a column-oriented storage that allows for random access.
161
164
The only compression applied is bitpacking. The column comes with two meta data.
162
165
The minimum value in the column and the number of bits per doc.
163
166
164
167
Fetching a value for a `DocId` is then as simple as computing
@@ -190,7 +193,7 @@ For advanced search engine, it is possible to store all of the features required
190
193
191
194
Finally facets are a specific kind of fast field, and the associated source code is in [`fastfield/facet_reader.rs`](src/fastfield/facet_reader.rs).
192
195
193
-
# The inverted search index.
196
+
# The inverted search index
194
197
195
198
The inverted index is the core part of full-text search.
196
199
When presented a new document with the text field "Hello, happy tax payer!", tantivy breaks it into a list of so-called tokens. In addition to just splitting these strings into tokens, it might also do different kinds of operations like dropping the punctuation, converting the character to lowercase, apply stemming, etc. Tantivy makes it possible to configure the operations to be applied in the schema (tokenizer/ is the place where these operations are implemented).
@@ -215,19 +218,18 @@ The inverted index actually consists of two data structures chained together.
215
218
216
219
Where [TermInfo](src/postings/term_info.rs) is an object containing some meta data about a term.
217
220
218
-
219
-
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)!
221
+
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)
220
222
221
223
Tantivy's term dictionary is mainly in charge of supplying the function
-[Term](src/schema/term.rs) ⟶ [TermOrdinal](src/termdict/mod.rs) is addressed by a finite state transducer, implemented by the fst crate.
227
230
-[TermOrdinal](src/termdict/mod.rs) ⟶ [TermInfo](src/postings/term_info.rs) is addressed by the term info store.
228
231
229
-
230
-
## [postings/](src/postings): Iterate over documents... very fast!
232
+
## [postings/](src/postings): Iterate over documents... very fast
231
233
232
234
A posting list makes it possible to store a sorted list of doc ids and for each doc store
233
235
a term frequency as well.
@@ -257,7 +259,6 @@ we advance the position reader by the number of term frequencies of the current
257
259
The [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) formula also requires to know the number of tokens stored in a specific field for a given document. We store this information on one byte per document in the fieldnorm.
258
260
The fieldnorm is therefore compressed. Values up to 40 are encoded unchanged.
259
261
260
-
261
262
## [tokenizer/](src/tokenizer): How should we process text?
262
263
263
264
Text processing is key to a good search experience.
@@ -268,7 +269,6 @@ Text processing can be configured by selecting an off-the-shelf [`Tokenizer`](./
268
269
269
270
Tantivy's comes with few tokenizers, but external crates are offering advanced tokenizers, such as [Lindera](https://crates.io/crates/lindera) for Japanese.
270
271
271
-
272
272
## [query/](src/query): Define and compose queries
273
273
274
274
The [Query](src/query/query.rs) trait defines what a query is.
0 commit comments