The personalization of arXiv paper is done using a vector search between the latest papers that appears in an arXiv category and the papers that the user is currently focussing on in his research. All those papers exists in a Zotero folder.


import arxiv
import chromadb
import cohere
import os
from nameparser import HumanName
from pyzotero import zotero
from readnext.arxiv_categories import exists
from readnext.embedding import pdf_to_text, get_embeddings, embedding_system
from rich import print
from rich.progress import Progress

Get a Zotero collection ID from its name

When interacting with the Zotero API, it is always expecting a collection ID. However, it is very hard to get the ID of that collection from the Zotero user interface. This utility function is used to get the ID of a collection from its name.


 get_collection_id_from_name (collection_name:str)

Return the ID of a collection from its name. Return an empty string if no collection’s name doesn’t exists. The comparison is case insensitive.

Get all the items of a Zotero collection name

Gets all the items of a Zotero collection from its name. It will reuse the function get_collection_id_from_name to get the collection ID from its name. An item can be very broad, those are not just the PDF papers, it could be links to web pages, full text notes, etc.


 get_target_collection_items (collection_name:str)

Given the name of a Zotero collection, return all the items from that collection.

Create corpus of interests from Zotero collection

What we call a “corpus of interest” is a Zotero collection that contains all the papers that the user is currently focussing on in his research. This function will create a corpus of interest from a Zotero collection name.

This corpus of interest is used to create an “embedding of interest” that will be used to select the most relevant papers that are published every day.


 create_interests_corpus (collection_name:str)

Create a corpus of interests from all the documents existing in a Zotero collection. This corpus will be used to match related daily papers published on ArXiv.

Get personalized papers

Query the embeddings space of the input category using the embedding of the corpus of interests. Returns nb_proposals more relevant papers.


 get_personalized_papers (category:str, zotero_collection:str,

Given a ArXiv category and a Zotero personalization collection. Returns a dictionary where the keys are the personalized ArXiv IDs, and the value the distance to the personalization embedding.

Get the summary of a PDF file

In addition, the user may want to have a summary of the paper (other than the abstract written by the author). If it is the case, then the paper’s text will be summarized by an external summarization service (currently Cohere) and will return the summary. That summary will then be added as an attachement to the paper’s item in Zotero.


 get_pdf_summary (pdf)

Check if a given paper is already in the collection of proposed papers

This is used to avoid duplicated papers in the Zotero collection, otherwise every time someone run ReadNext, it will duplicate the proposed papers if they were already proposed in the past.


 check_already_in_zotero_proposals (title:str, proposals_collection:str)

Check if a paper is already in the proposals collection.

Save all personalized papers in Zotero

Save all the personalized papers in Zotero. By default, no artifacts are saved in Zotero. The reason is that users have 200mo free with their account, and that space is taken rapidly if we save artifacts days in days out. However, if the user is paying for more space, then he most likely want to have the artifacts saved in Zotero.


 save_personalized_papers_in_zotero (ids:dict, proposals_collection,

Get all personalized papers propositions and upload them to the proposals_collection Zotero collection.

If with_artifacts=True, then all documents artifacts will be uploaded to Zotero as well (namely PDFs and summary documents), but it will take more space to the Zotero account and will be slower to process.