Embeddings

By default, ReadNext uses Hugging Face models that it downloads locally to generate the embeddings. Optionally, it can use external embedding services. At the moment, it is only integrated with the Cohere Embedding model.

Imports

Download Embedding Model

To be able to use local embedding model, the first step is to download them from Hugging Face using their Transformers library and save them locally on the file system.

download_embedding_model

 download_embedding_model (model_path:str, model_name:str)

Download a Hugging Face model and tokenizer to the specified directory

Tests

from shutil import rmtree

download_embedding_model('test-download/', 'prajjwal1/bert-tiny')

assert os.path.exists('test-download/config.json')
assert os.path.exists('test-download/pytorch_model.bin')
assert os.path.exists('test-download/special_tokens_map.json')
assert os.path.exists('test-download/tokenizer_config.json')
assert os.path.exists('test-download/vocab.txt')

# tears down 
rmtree('test-download/')

Load Embedding Model

Once the models are available locally, the next step is to load them in memory to be able to use them to create the embeddings for the PDF files. Because load_embedding_model can be called numerous time, we do memoize the result to speed up the process. There is no need to use a LRU cache here since only a single item should be cached anyway, so let’s simplify the code.

load_embedding_model

 load_embedding_model (model_path:str)

Load a Hugging Face model and tokenizer from the specified directory

Tests

from shutil import rmtree

download_embedding_model('test-download/', 'prajjwal1/bert-tiny')

model, tokenizer = load_embedding_model('test-download/')

assert model is not None
assert tokenizer is not None

# tears down 
rmtree('test-download/')

Embed (Local Model)

embed_text

 embed_text (text:str, model, tokenizer)

Embed a text using a Hugging Face model and tokenizer

Tests

from shutil import rmtree

download_embedding_model('test-download/', 'BAAI/bge-base-en')

model, tokenizer = load_embedding_model('test-download/')

tensor = embed_text('Hello world!', model, tokenizer)

assert len(tensor.tolist()[0]) == 128

# tears down 
rmtree('test-download/')

Get Embedding System

We need to be able to easily identify the embedding system currently configured by the user. This is a utility function to simply the comprehension of the code elsewhere in the codebase.

embedding_system

 embedding_system ()

Return a unique identifier for the embedding system currently in use

Get Embeddings (From any supporter system)

get_embeddings

 get_embeddings (text:str)

Get embeddings for a text using any supported embedding system.

PDF to Text

The library PdfReader is used to extract the text from the PDF files.

pdf_to_text

 pdf_to_text (file_path:str)

Read a PDF file and output it as a text string.

Tests

assert pdf_to_text("../tests/assets/test.pdf") == "this is a test"
assert pdf_to_text("../tests/assets/test.pdf") != "this is a test foo"

Get PDF files from a folder

get_pdfs_from_folder

 get_pdfs_from_folder (folder_path:str)

Given a folder path, return all the PDF files existing in that folder.

Tests

assert get_pdfs_from_folder("../tests/assets/") == ['test.pdf']
assert get_pdfs_from_folder("../tests/assets/") != ['test.pdf', 'foo.pdf']

Get Chroma Collection Name

It is important that the number of dimensions of the embedding is the same in a Chroma collection and when it gets queried. For example, depending what the users want to use, he may at one time use the local embedding model and at another time use the Cohere embedding service. In both cases, the number of dimensions of the embedding will be different. To avoid this problem, we use the name of the collection to determine the number of dimensions of the embedding. This way, the number of dimensions will be the same for a given collection, no matter what embedding model is used.

def get_chroma_collection_name(name: str) -> str:
    """Get the name of the ChromaDB collection to use."""
    
    return os.environ.get('CHROMA_COLLECTION_NAME')

Embed all papers of a arXiv category

The embedding database management system ReadNext uses is Chroma.

The embedding DBMS is organized as follows:

Each category (sub or top categories) become a collection of embeddings
We have one global collection named all that contains all the embeddings of every known categories

When a new arXiv category is being processing, all the embeddings of the papers it contains will be added to the collection related to its category, and to the global collection.

For the category collection, we have to prefix each category with _arxiv to avoid the restriction that Chroma won’t accept a collection name with less than three characters.

embed_category_papers

 embed_category_papers (category:str)

Given a ArXiv category, create the embeddings for each of the PDF paper existing locally. Embeddings is currently using Cohere’s embedding service. Returns True if successful, False otherwise.