Embeddings Tools

Collection of Node.js tools to compare documents using embeddings. Documents are stored in a library, which currently supports GitHub issues only.

Input Data

The input data is a collection of documents in the following format:

[
  {
    number: 4711,
    title: 'Title of the issue',
    labels: ['label1', 'label2'],
    body: 'Issue description'
  }, // …
]

It can be retrieved from the GitHub list repository issues API.

Data Transformation

Each of the document is transformed into a vector using a pipeline with the following steps, each feeding into the next:

Each label is converted into a simple string unless it is already a string.
Comments are received from the list issue comments API and converted into a string.
The issue is transformed into a text string. (This currently includes the title, body, and comments.)
Code delimiters are removed.
Stacks are reduced by omitting irrelevant frames. Leading and trailing whitespace is removed in the same step. (This step is somewhat opinionated.)
Local paths are stripped of common prefixes. (This step is somewhat opinionated.)
The string is converted to lowercase.
The string is transformed into tokens (tokenized).
Stopwords are removed.
The tokens are transformed into n-grams.
The TF-IDF is calculated for each n-gram. All TF-IDF values are normalized to the range [0, 1].
For each document, the n-grams are filtered by a TF-IDF threshold (0.1 by default).
Each remaining n-gram is transformed into an embedding vector using the OpenAI embeddings API.
The embeddings are transformed into a single vector for the document by calculating their mean value.

Similarity Calculation

The similarity between two documents is computed as the cosine similarity between their embeddings. Since OpenAI embeddings are normalized, cosine similarity is identical with scalar (dot) product.

Usage

const corpus = [
  {
    number: 124,
    title: 'Cannot render page',
    labels: ['bug'],
    body: `I get an error message: "Known problem with renderer: 124 (failed to reconcile)". It may be related to the reconcile process.`
  },
  {
    'number': 125,
    'title': 'Database timeout after jiffy',
    'labels': ['bug', 'database'],
    'body': 'The database connection times out within a jiffy when trying to query large datasets. It seems related to the connection pool limit.'
  },
  {
    'number': 126,
    'title': 'Cannot connect to database',
    'labels': ['bug', 'database'],
    'body': 'The database connection fails with an error message: "Connection refused". It may be related to the connection string.'
  }
]

const Library = require('src/docs/library')
const library = new Library()
await library.init(corpus)

const newDoc = {
  number: 127,
  title: 'Failure to query in a jiffy',
  labels: ['database', 'bug'],
  body: 'When I send a large query to the database, it times out before a jiffy has passed. Is there a connection pool limit?'
}

await library.addDoc(newDoc)
console.dir(library.getMostSimilarDocs(127))

Configuration

There are two places for config settings. Adjust them as needed before running the code:

data/.private/config.json. This file is checked in as a template and holds private settings.
- ⚠️ Do not check in your actual settings (e.g. in a fork)! Run git update-index --assume-unchanged data/.private/config.json to prevent accidental check-ins.
data/config.js. This file is checked in and holds public settings.

Notes on Precision

The following measures are taken to potentially improve precision. Numbers are order-of-magnitude from a quick exemplary run.

Measure	Relative Difference in Cosine Similarity
Use Decimal.js for calculations	1E-15
Re-normalize embeddings	1E-8

Known Issues

Rate limiting is not handled. (BLI)
Settings are currently local to this repo.
GitHub org is currently required.
OpenAI API version is currently fixed.
Settings are missing to control:
- the parameters of the pipeline functions (BLI)
Cost calculation is missing. (BLI)
Implemented in CJS.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
data		data
docs		docs
src		src
test		test
.gitignore		.gitignore
BACKLOG.md		BACKLOG.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embeddings Tools

Input Data

Data Transformation

Similarity Calculation

Usage

Configuration

Notes on Precision

Known Issues

About

Releases

Packages

Languages

License

tim-sh/embeddings

Folders and files

Latest commit

History

Repository files navigation

Embeddings Tools

Input Data

Data Transformation

Similarity Calculation

Usage

Configuration

Notes on Precision

Known Issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages