By default, ReadNext uses Hugging Face models that it downloads locally to generate the embeddings. Optionally, it can use external embedding services. At the moment, it is only integrated with the Cohere Embedding model.


Download Embedding Model

To be able to use local embedding model, the first step is to download them from Hugging Face using their Transformers library and save them locally on the file system.


 download_embedding_model (model_path:str, model_name:str)

Download a Hugging Face model and tokenizer to the specified directory


from shutil import rmtree
download_embedding_model('test-download/', 'prajjwal1/bert-tiny')

assert os.path.exists('test-download/config.json')
assert os.path.exists('test-download/pytorch_model.bin')
assert os.path.exists('test-download/special_tokens_map.json')
assert os.path.exists('test-download/tokenizer_config.json')
assert os.path.exists('test-download/vocab.txt')

# tears down 

Load Embedding Model

Once the models are available locally, the next step is to load them in memory to be able to use them to create the embeddings for the PDF files. Because load_embedding_model can be called numerous time, we do memoize the result to speed up the process. There is no need to use a LRU cache here since only a single item should be cached anyway, so let’s simplify the code.


 load_embedding_model (model_path:str)

Load a Hugging Face model and tokenizer from the specified directory


from shutil import rmtree
download_embedding_model('test-download/', 'prajjwal1/bert-tiny')

model, tokenizer = load_embedding_model('test-download/')

assert model is not None
assert tokenizer is not None

# tears down 

Embed (Local Model)


 embed_text (text:str, model, tokenizer)

Embed a text using a Hugging Face model and tokenizer


from shutil import rmtree
download_embedding_model('test-download/', 'BAAI/bge-base-en')

model, tokenizer = load_embedding_model('test-download/')

tensor = embed_text('Hello world!', model, tokenizer)

assert len(tensor.tolist()[0]) == 128

# tears down 

Get Embedding System

We need to be able to easily identify the embedding system currently configured by the user. This is a utility function to simply the comprehension of the code elsewhere in the codebase.


 embedding_system ()

Return a unique identifier for the embedding system currently in use

Get Embeddings (From any supporter system)


 get_embeddings (text:str)

Get embeddings for a text using any supported embedding system.

PDF to Text

The library PdfReader is used to extract the text from the PDF files.


 pdf_to_text (file_path:str)

Read a PDF file and output it as a text string.


assert pdf_to_text("../tests/assets/test.pdf") == "this is a test"
assert pdf_to_text("../tests/assets/test.pdf") != "this is a test foo"

Get PDF files from a folder


 get_pdfs_from_folder (folder_path:str)

Given a folder path, return all the PDF files existing in that folder.


assert get_pdfs_from_folder("../tests/assets/") == ['test.pdf']
assert get_pdfs_from_folder("../tests/assets/") != ['test.pdf', 'foo.pdf']

Get Chroma Collection Name

It is important that the number of dimensions of the embedding is the same in a Chroma collection and when it gets queried. For example, depending what the users want to use, he may at one time use the local embedding model and at another time use the Cohere embedding service. In both cases, the number of dimensions of the embedding will be different. To avoid this problem, we use the name of the collection to determine the number of dimensions of the embedding. This way, the number of dimensions will be the same for a given collection, no matter what embedding model is used.

def get_chroma_collection_name(name: str) -> str:
    """Get the name of the ChromaDB collection to use."""
    return os.environ.get('CHROMA_COLLECTION_NAME')

Embed all papers of a arXiv category

The embedding database management system ReadNext uses is Chroma.

The embedding DBMS is organized as follows:

  • Each category (sub or top categories) become a collection of embeddings
  • We have one global collection named all that contains all the embeddings of every known categories

When a new arXiv category is being processing, all the embeddings of the papers it contains will be added to the collection related to its category, and to the global collection.

For the category collection, we have to prefix each category with _arxiv to avoid the restriction that Chroma won’t accept a collection name with less than three characters.


 embed_category_papers (category:str)

Given a ArXiv category, create the embeddings for each of the PDF paper existing locally. Embeddings is currently using Cohere’s embedding service. Returns True if successful, False otherwise.