arXiv Synchronization

Synchronize all latest papers from arXiv with the PDF existing on the local file system.

Load default .dotenv file for running the test upon the execution of this notebook. You can remove '../.dotenv' if you already configured your .env file locally.

from dotenv import load_dotenv
import os

load_dotenv('../.dotenv')

Imports

import concurrent.futures
import feedparser
import os
import re
import urllib.request
from pypdf import PdfReader
from readnext.arxiv_categories import exists
from rich import print
from rich.progress import Progress

Get daily papers from arXiv

The first step is to get all the new papers from arXiv. This is done by using their daily RSS feed for any given top, or sub, category. We parse the RSS feed to extract all new papers from the archive.

get_arxiv_pdfs_url

 get_arxiv_pdfs_url (category:str)

Get all the papers refferenced in the daily RSS feed on ArXiv for input ‘category’.

Get Docs Path

The synchronization process incurs downloading all the new PDF file for a category on the local file system. The DOCS_PATH environment variable specify where the documents will be saved.

The new PDF files will be saved in the DOCS_PATH/[category]/ folder. The get_docs_path function returns the path string to the folder for a given category.

get_docs_path

 get_docs_path (category:str)

Generate the proper docs path from a category ID

Tests

assert get_docs_path("cs") == os.environ.get('DOCS_PATH').rstrip('/') + '/' + "cs" + '/'
assert get_docs_path("cs.AI") == os.environ.get('DOCS_PATH').rstrip('/') + '/' + "cs.AI" + '/'
assert get_docs_path("cs.FOO") != os.environ.get('DOCS_PATH').rstrip('/') + '/' + "cs.AI" + '/'

Delete broken PDF files

In rare occurences, it may happen that the downloaded PDF file are broken. The current process to detect and fix this issue is to try to open every downloaded PDF with PdfReader. If an exception is thrown, then we simply delete the file and move on.

In the future, we will have to replace that mechanism with a better fail over mechanism.

The side effect of running delete_broken_pdf is that it may delete broken PDF files from the file system for a category.

delete_broken_pdf

 delete_broken_pdf (category:str)

Detect and delete broken PDF files. TODO Next iteration needs a better fail over with retry when PDF files are broken from a download.

Tests

from shutil import rmtree
from os.path import split

from unittest.mock import patch

with patch.dict('os.environ', {'DOCS_PATH': 'docs/'}):
    # your code that uses os.environ.get('DOCS_PATH') here

    # count the current number of PDF files in docs_path
    docs_path = get_docs_path("cs")
    os.makedirs(docs_path, exist_ok=True)
    pdf_files = os.listdir(docs_path)
    pdf_files_count_before = len(pdf_files)

    # create and empty PDF file at docs_path to produce an invalid PDF file
    docs_path = get_docs_path("cs")
    open(docs_path + "foo.pdf", 'a').close()

    # run delete_broken_pdf
    delete_broken_pdf("cs")

    # count the number of PDF files in docs_path
    pdf_files = os.listdir(docs_path)
    pdf_files_count_after = len(pdf_files)

    assert pdf_files_count_after == pdf_files_count_before

    # cleanup
    rmtree(split(split(docs_path)[0])[0])

Synchronize with arXiv

The sync_arxiv function is the main function that will synchronize the local file system with arXiv. It will download all the new PDF files from arXiv and delete any broken PDF files. It downloads three PDF files concurrently.

sync_arxiv

 sync_arxiv (category:str)

Synchronize all latest arxiv papers for category. Concurrently download three PDF files from ArXiv. The PDF files will be saved in the DOCS_PATH folder under the category’s sub-folder.