from shutil import rmtree
Embeddings
Imports
Download Embedding Model
To be able to use local embedding model, the first step is to download them from Hugging Face using their Transformers library and save them locally on the file system.
download_embedding_model
download_embedding_model (model_path:str, model_name:str)
Download a Hugging Face model and tokenizer to the specified directory
Tests
'test-download/', 'prajjwal1/bert-tiny')
download_embedding_model(
assert os.path.exists('test-download/config.json')
assert os.path.exists('test-download/pytorch_model.bin')
assert os.path.exists('test-download/special_tokens_map.json')
assert os.path.exists('test-download/tokenizer_config.json')
assert os.path.exists('test-download/vocab.txt')
# tears down
'test-download/') rmtree(
Load Embedding Model
Once the models are available locally, the next step is to load them in memory to be able to use them to create the embeddings for the PDF files. Because load_embedding_model
can be called numerous time, we do memoize the result to speed up the process. There is no need to use a LRU cache here since only a single item should be cached anyway, so let’s simplify the code.
load_embedding_model
load_embedding_model (model_path:str)
Load a Hugging Face model and tokenizer from the specified directory
Tests
from shutil import rmtree
'test-download/', 'prajjwal1/bert-tiny')
download_embedding_model(
= load_embedding_model('test-download/')
model, tokenizer
assert model is not None
assert tokenizer is not None
# tears down
'test-download/') rmtree(
Embed (Local Model)
embed_text
embed_text (text:str, model, tokenizer)
Embed a text using a Hugging Face model and tokenizer
Tests
from shutil import rmtree
'test-download/', 'BAAI/bge-base-en')
download_embedding_model(
= load_embedding_model('test-download/')
model, tokenizer
= embed_text('Hello world!', model, tokenizer)
tensor
assert len(tensor.tolist()[0]) == 128
# tears down
'test-download/') rmtree(
Get Embedding System
We need to be able to easily identify the embedding system currently configured by the user. This is a utility function to simply the comprehension of the code elsewhere in the codebase.
embedding_system
embedding_system ()
Return a unique identifier for the embedding system currently in use
Get Embeddings (From any supporter system)
get_embeddings
get_embeddings (text:str)
Get embeddings for a text using any supported embedding system.
PDF to Text
The library PdfReader is used to extract the text from the PDF files.
pdf_to_text
pdf_to_text (file_path:str)
Read a PDF file and output it as a text string.
Tests
assert pdf_to_text("../tests/assets/test.pdf") == "this is a test"
assert pdf_to_text("../tests/assets/test.pdf") != "this is a test foo"
Get PDF files from a folder
get_pdfs_from_folder
get_pdfs_from_folder (folder_path:str)
Given a folder path, return all the PDF files existing in that folder.
Tests
assert get_pdfs_from_folder("../tests/assets/") == ['test.pdf']
assert get_pdfs_from_folder("../tests/assets/") != ['test.pdf', 'foo.pdf']
Get Chroma Collection Name
It is important that the number of dimensions of the embedding is the same in a Chroma collection and when it gets queried. For example, depending what the users want to use, he may at one time use the local embedding model and at another time use the Cohere embedding service. In both cases, the number of dimensions of the embedding will be different. To avoid this problem, we use the name of the collection to determine the number of dimensions of the embedding. This way, the number of dimensions will be the same for a given collection, no matter what embedding model is used.
def get_chroma_collection_name(name: str) -> str:
"""Get the name of the ChromaDB collection to use."""
return os.environ.get('CHROMA_COLLECTION_NAME')
Embed all papers of a arXiv category
The embedding database management system ReadNext uses is Chroma.
The embedding DBMS is organized as follows:
- Each category (sub or top categories) become a collection of embeddings
- We have one global collection named
all
that contains all the embeddings of every known categories
When a new arXiv category is being processing, all the embeddings of the papers it contains will be added to the collection related to its category, and to the global collection.
For the category collection, we have to prefix each category with _arxiv
to avoid the restriction that Chroma won’t accept a collection name with less than three characters.
embed_category_papers
embed_category_papers (category:str)
Given a ArXiv category, create the embeddings for each of the PDF paper existing locally. Embeddings is currently using Cohere’s embedding service. Returns True if successful, False otherwise.