Personalization

The personalization of arXiv paper is done using a vector search between the latest papers that appears in an arXiv category and the papers that the user is currently focussing on in his research. All those papers exists in a Zotero folder.

Imports

import arxiv
import chromadb
import cohere
import os
from nameparser import HumanName
from pyzotero import zotero
from readnext.arxiv_categories import exists
from readnext.embedding import pdf_to_text, get_embeddings, embedding_system
from rich import print
from rich.progress import Progress

Get a Zotero collection ID from its name

When interacting with the Zotero API, it is always expecting a collection ID. However, it is very hard to get the ID of that collection from the Zotero user interface. This utility function is used to get the ID of a collection from its name.

get_collection_id_from_name

 get_collection_id_from_name (collection_name:str)

Return the ID of a collection from its name. Return an empty string if no collection’s name doesn’t exists. The comparison is case insensitive.

Get all the items of a Zotero collection name

Gets all the items of a Zotero collection from its name. It will reuse the function get_collection_id_from_name to get the collection ID from its name. An item can be very broad, those are not just the PDF papers, it could be links to web pages, full text notes, etc.

get_target_collection_items

 get_target_collection_items (collection_name:str)

Given the name of a Zotero collection, return all the items from that collection.

Create corpus of interests from Zotero collection

What we call a “corpus of interest” is a Zotero collection that contains all the papers that the user is currently focussing on in his research. This function will create a corpus of interest from a Zotero collection name.

This corpus of interest is used to create an “embedding of interest” that will be used to select the most relevant papers that are published every day.

create_interests_corpus

 create_interests_corpus (collection_name:str)

Create a corpus of interests from all the documents existing in a Zotero collection. This corpus will be used to match related daily papers published on ArXiv.

Get personalized papers

Query the embeddings space of the input category using the embedding of the corpus of interests. Returns nb_proposals more relevant papers.

get_personalized_papers

 get_personalized_papers (category:str, zotero_collection:str,
                          nb_proposals=10)

Given a ArXiv category and a Zotero personalization collection. Returns a dictionary where the keys are the personalized ArXiv IDs, and the value the distance to the personalization embedding.

Get the summary of a PDF file

In addition, the user may want to have a summary of the paper (other than the abstract written by the author). If it is the case, then the paper’s text will be summarized by an external summarization service (currently Cohere) and will return the summary. That summary will then be added as an attachement to the paper’s item in Zotero.

get_pdf_summary

 get_pdf_summary (pdf)

Check if a given paper is already in the collection of proposed papers

This is used to avoid duplicated papers in the Zotero collection, otherwise every time someone run ReadNext, it will duplicate the proposed papers if they were already proposed in the past.

check_already_in_zotero_proposals

 check_already_in_zotero_proposals (title:str, proposals_collection:str)

Check if a paper is already in the proposals collection.

Save all personalized papers in Zotero

Save all the personalized papers in Zotero. By default, no artifacts are saved in Zotero. The reason is that users have 200mo free with their account, and that space is taken rapidly if we save artifacts days in days out. However, if the user is paying for more space, then he most likely want to have the artifacts saved in Zotero.

save_personalized_papers_in_zotero

 save_personalized_papers_in_zotero (ids:dict, proposals_collection,
                                     with_artifacts:bool)

Get all personalized papers propositions and upload them to the proposals_collection Zotero collection.

If with_artifacts=True, then all documents artifacts will be uploaded to Zotero as well (namely PDFs and summary documents), but it will take more space to the Zotero account and will be slower to process.