Skip to content

Commit 19074e1

Browse files
authored
Merge pull request #1445 from kianmeng/fix-typos-and-markdowns
Fix typos and markdowns
2 parents 8e773ad + 014b1ad commit 19074e1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+201
-203
lines changed

ARCHITECTURE.md

+19-19
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Tantivy's bread and butter is to address the problem of full-text search :
1010
Given a large set of textual documents, and a text query, return the K-most relevant documents in a very efficient way. To execute these queries rapidly, the tantivy needs to build an index beforehand. The relevance score implemented in the tantivy is not configurable. Tantivy uses the same score as the default similarity used in Lucene / Elasticsearch, called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
1111

1212
But tantivy's scope does not stop there. Numerous features are required to power rich-search applications. For instance, one may want to:
13+
1314
- compute the count of documents matching a query in the different section of an e-commerce website,
1415
- display an average price per meter square for a real estate search engine,
1516
- take into account historical user data to rank documents in a specific way,
@@ -22,27 +23,28 @@ rapidly select all documents matching a given predicate (also known as a query)
2223
collect some information about them ([See collector](#collector-define-what-to-do-with-matched-documents)).
2324

2425
Roughly speaking the design is following these guiding principles:
26+
2527
- Search should be O(1) in memory.
2628
- Indexing should be O(1) in memory. (In practice it is just sublinear)
2729
- Search should be as fast as possible
2830

2931
This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
3032

31-
## [core/](src/core): Index, segments, searchers.
33+
## [core/](src/core): Index, segments, searchers
3234

3335
Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.
3436

3537
This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.
3638

37-
### Index and Segments...
39+
### Index and Segments
3840

39-
A tantivy index is a collection of smaller independent immutable segments.
41+
A tantivy index is a collection of smaller independent immutable segments.
4042
Each segment contains its own independent set of data structures.
4143

4244
A segment is identified by a segment id that is in fact a UUID.
4345
The file of a segment has the format
4446

45-
```segment-id . ext ```
47+
```segment-id . ext```
4648

4749
The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.
4850

@@ -52,17 +54,15 @@ On commit, one segment per indexing thread is written to disk, and the `meta.jso
5254

5355
For a better idea of how indexing works, you may read the [following blog post](https://fulmicoton.com/posts/behold-tantivy-part2/).
5456

55-
5657
### Deletes
5758

5859
Deletes happen by deleting a "term". Tantivy does not offer any notion of primary id, so it is up to the user to use a field in their schema as if it was a primary id, and delete the associated term if they want to delete only one specific document.
5960

6061
On commit, tantivy will find all of the segments with documents matching this existing term and remove from [alive bitset file](src/fastfield/alive_bitset.rs) that represents the bitset of the alive document ids.
61-
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ``` segment_id . commit_opstamp . del```.
62+
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.
6263

6364
An opstamp is simply an incremental id that identifies any operation applied to the index. For instance, performing a commit or adding a document.
6465

65-
6666
### DocId
6767

6868
Within a segment, all documents are identified by a DocId that ranges within `[0, max_doc)`.
@@ -74,6 +74,7 @@ The DocIds are simply allocated in the order documents are added to the index.
7474

7575
In separate threads, tantivy's index writer search for opportunities to merge segments.
7676
The point of segment merge is to:
77+
7778
- eventually get rid of tombstoned documents
7879
- reduce the otherwise ever-growing number of segments.
7980

@@ -104,6 +105,7 @@ Tantivy's document follows a very strict schema, decided before building any ind
104105
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.
105106

106107
Depending on the type of the field, you can decide to
108+
107109
- put it in the docstore
108110
- store it as a fast field
109111
- index it
@@ -117,9 +119,10 @@ As of today, tantivy's schema imposes a 1:1 relationship between a field that is
117119

118120
This is not something tantivy supports, and it is up to the user to duplicate field / concatenate fields before feeding them to tantivy.
119121

120-
## General information about these data structures.
122+
## General information about these data structures
121123

122124
All data structures in tantivy, have:
125+
123126
- a writer
124127
- a serializer
125128
- a reader
@@ -132,7 +135,7 @@ This conversion is done by the serializer.
132135
Finally, the reader is in charge of offering an API to read on this on-disk read-only representation.
133136
In tantivy, readers are designed to require very little anonymous memory. The data is read straight from an mmapped file, and loading an index is as fast as mmapping its files.
134137

135-
## [store/](src/store): Here is my DocId, Gimme my document!
138+
## [store/](src/store): Here is my DocId, Gimme my document
136139

137140
The docstore is a row-oriented storage that, for each document, stores a subset of the fields
138141
that are marked as stored in the schema. The docstore is compressed using a general-purpose algorithm
@@ -146,6 +149,7 @@ Once the top 10 documents have been identified, we fetch them from the store, an
146149
**Not useful for**
147150

148151
Fetching a document from the store is typically a "slow" operation. It usually consists in
152+
149153
- searching into a compact tree-like data structure to find the position of the right block.
150154
- decompressing a small block
151155
- returning the document from this block.
@@ -154,16 +158,15 @@ It is NOT meant to be called for every document matching a query.
154158

155159
As a rule of thumb, if you hit the docstore more than 100 times per search query, you are probably misusing tantivy.
156160

157-
158-
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value!
161+
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value
159162

160163
Fast fields are stored in a column-oriented storage that allows for random access.
161164
The only compression applied is bitpacking. The column comes with two meta data.
162165
The minimum value in the column and the number of bits per doc.
163166

164167
Fetching a value for a `DocId` is then as simple as computing
165168

166-
```
169+
```rust
167170
min_value + fetch_bits(num_bits * doc_id..num_bits * (doc_id+1))
168171
```
169172

@@ -190,7 +193,7 @@ For advanced search engine, it is possible to store all of the features required
190193

191194
Finally facets are a specific kind of fast field, and the associated source code is in [`fastfield/facet_reader.rs`](src/fastfield/facet_reader.rs).
192195

193-
# The inverted search index.
196+
# The inverted search index
194197

195198
The inverted index is the core part of full-text search.
196199
When presented a new document with the text field "Hello, happy tax payer!", tantivy breaks it into a list of so-called tokens. In addition to just splitting these strings into tokens, it might also do different kinds of operations like dropping the punctuation, converting the character to lowercase, apply stemming, etc. Tantivy makes it possible to configure the operations to be applied in the schema (tokenizer/ is the place where these operations are implemented).
@@ -215,19 +218,18 @@ The inverted index actually consists of two data structures chained together.
215218

216219
Where [TermInfo](src/postings/term_info.rs) is an object containing some meta data about a term.
217220

218-
219-
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)!
221+
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)
220222

221223
Tantivy's term dictionary is mainly in charge of supplying the function
222224

223225
[Term](src/schema/term.rs)[TermInfo](src/postings/term_info.rs)
224226

225227
It is itself broken into two parts.
228+
226229
- [Term](src/schema/term.rs)[TermOrdinal](src/termdict/mod.rs) is addressed by a finite state transducer, implemented by the fst crate.
227230
- [TermOrdinal](src/termdict/mod.rs)[TermInfo](src/postings/term_info.rs) is addressed by the term info store.
228231

229-
230-
## [postings/](src/postings): Iterate over documents... very fast!
232+
## [postings/](src/postings): Iterate over documents... very fast
231233

232234
A posting list makes it possible to store a sorted list of doc ids and for each doc store
233235
a term frequency as well.
@@ -257,7 +259,6 @@ we advance the position reader by the number of term frequencies of the current
257259
The [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) formula also requires to know the number of tokens stored in a specific field for a given document. We store this information on one byte per document in the fieldnorm.
258260
The fieldnorm is therefore compressed. Values up to 40 are encoded unchanged.
259261

260-
261262
## [tokenizer/](src/tokenizer): How should we process text?
262263

263264
Text processing is key to a good search experience.
@@ -268,7 +269,6 @@ Text processing can be configured by selecting an off-the-shelf [`Tokenizer`](./
268269

269270
Tantivy's comes with few tokenizers, but external crates are offering advanced tokenizers, such as [Lindera](https://crates.io/crates/lindera) for Japanese.
270271

271-
272272
## [query/](src/query): Define and compose queries
273273

274274
The [Query](src/query/query.rs) trait defines what a query is.

0 commit comments

Comments
 (0)