Collection of Node.js tools to compare documents using embeddings. Documents are stored in a library, which currently supports GitHub issues only.
The input data is a collection of documents in the following format:
[
{
number: 4711,
title: 'Title of the issue',
labels: ['label1', 'label2'],
body: 'Issue description'
}, // …
]
It can be retrieved from the GitHub list repository issues API.
Each of the document is transformed into a vector using a pipeline with the following steps, each feeding into the next:
- Each label is converted into a simple string unless it is already a string.
- Comments are received from the list issue comments API and converted into a string.
- The issue is transformed into a text string. (This currently includes the title, body, and comments.)
- Code delimiters are removed.
- Stacks are reduced by omitting irrelevant frames. Leading and trailing whitespace is removed in the same step. (This step is somewhat opinionated.)
- Local paths are stripped of common prefixes. (This step is somewhat opinionated.)
- The string is converted to lowercase.
- The string is transformed into tokens (tokenized).
- Stopwords are removed.
- The tokens are transformed into n-grams.
- The TF-IDF is calculated for each n-gram. All TF-IDF values are normalized to the range [0, 1].
- For each document, the n-grams are filtered by a TF-IDF threshold (0.1 by default).
- Each remaining n-gram is transformed into an embedding vector using the OpenAI embeddings API.
- The embeddings are transformed into a single vector for the document by calculating their mean value.
The similarity between two documents is computed as the cosine similarity between their embeddings. Since OpenAI embeddings are normalized, cosine similarity is identical with scalar (dot) product.
const corpus = [
{
number: 124,
title: 'Cannot render page',
labels: ['bug'],
body: `I get an error message: "Known problem with renderer: 124 (failed to reconcile)". It may be related to the reconcile process.`
},
{
'number': 125,
'title': 'Database timeout after jiffy',
'labels': ['bug', 'database'],
'body': 'The database connection times out within a jiffy when trying to query large datasets. It seems related to the connection pool limit.'
},
{
'number': 126,
'title': 'Cannot connect to database',
'labels': ['bug', 'database'],
'body': 'The database connection fails with an error message: "Connection refused". It may be related to the connection string.'
}
]
const Library = require('src/docs/library')
const library = new Library()
await library.init(corpus)
const newDoc = {
number: 127,
title: 'Failure to query in a jiffy',
labels: ['database', 'bug'],
body: 'When I send a large query to the database, it times out before a jiffy has passed. Is there a connection pool limit?'
}
await library.addDoc(newDoc)
console.dir(library.getMostSimilarDocs(127))
There are two places for config settings. Adjust them as needed before running the code:
- data/.private/config.json. This file is checked in as a template and holds private settings.
⚠️ Do not check in your actual settings (e.g. in a fork)! Rungit update-index --assume-unchanged data/.private/config.json
to prevent accidental check-ins.
- data/config.js. This file is checked in and holds public settings.
The following measures are taken to potentially improve precision. Numbers are order-of-magnitude from a quick exemplary run.
Measure | Relative Difference in Cosine Similarity |
---|---|
Use Decimal.js for calculations | 1E-15 |
Re-normalize embeddings | 1E-8 |
- Rate limiting is not handled. (BLI)
- Settings are currently local to this repo.
- GitHub org is currently required.
- OpenAI API version is currently fixed.
- Settings are missing to control:
- the parameters of the pipeline functions (BLI)
- Cost calculation is missing. (BLI)
- Implemented in CJS.