import arxiv
import chromadb
import cohere
import os
from nameparser import HumanName
from pyzotero import zotero
from readnext.arxiv_categories import exists
from readnext.embedding import pdf_to_text, get_embeddings, embedding_system
from rich import print
from rich.progress import Progress
Personalization
Imports
Get a Zotero collection ID from its name
When interacting with the Zotero API, it is always expecting a collection ID. However, it is very hard to get the ID of that collection from the Zotero user interface. This utility function is used to get the ID of a collection from its name.
get_collection_id_from_name
get_collection_id_from_name (collection_name:str)
Return the ID of a collection from its name. Return an empty string if no collection’s name doesn’t exists. The comparison is case insensitive.
Get all the items of a Zotero collection name
Gets all the items of a Zotero collection from its name. It will reuse the function get_collection_id_from_name
to get the collection ID from its name. An item can be very broad, those are not just the PDF papers, it could be links to web pages, full text notes, etc.
get_target_collection_items
get_target_collection_items (collection_name:str)
Given the name of a Zotero collection, return all the items from that collection.
Create corpus of interests from Zotero collection
What we call a “corpus of interest” is a Zotero collection that contains all the papers that the user is currently focussing on in his research. This function will create a corpus of interest from a Zotero collection name.
This corpus of interest is used to create an “embedding of interest” that will be used to select the most relevant papers that are published every day.
create_interests_corpus
create_interests_corpus (collection_name:str)
Create a corpus of interests from all the documents existing in a Zotero collection. This corpus will be used to match related daily papers published on ArXiv.
Get personalized papers
Query the embeddings space of the input category using the embedding of the corpus of interests. Returns nb_proposals
more relevant papers.
get_personalized_papers
get_personalized_papers (category:str, zotero_collection:str, nb_proposals=10)
Given a ArXiv category and a Zotero personalization collection. Returns a dictionary where the keys are the personalized ArXiv IDs, and the value the distance to the personalization embedding.
Get the summary of a PDF file
In addition, the user may want to have a summary of the paper (other than the abstract written by the author). If it is the case, then the paper’s text will be summarized by an external summarization service (currently Cohere) and will return the summary. That summary will then be added as an attachement to the paper’s item in Zotero.
get_pdf_summary
get_pdf_summary (pdf)
Check if a given paper is already in the collection of proposed papers
This is used to avoid duplicated papers in the Zotero collection, otherwise every time someone run ReadNext, it will duplicate the proposed papers if they were already proposed in the past.
check_already_in_zotero_proposals
check_already_in_zotero_proposals (title:str, proposals_collection:str)
Check if a paper is already in the proposals collection.
Save all personalized papers in Zotero
Save all the personalized papers in Zotero. By default, no artifacts are saved in Zotero. The reason is that users have 200mo free with their account, and that space is taken rapidly if we save artifacts days in days out. However, if the user is paying for more space, then he most likely want to have the artifacts saved in Zotero.
save_personalized_papers_in_zotero
save_personalized_papers_in_zotero (ids:dict, proposals_collection, with_artifacts:bool)
Get all personalized papers propositions and upload them to the proposals_collection
Zotero collection.
If with_artifacts=True
, then all documents artifacts will be uploaded to Zotero as well (namely PDFs and summary documents), but it will take more space to the Zotero account and will be slower to process.