from dotenv import load_dotenv
import os
arXiv Synchronization
Load default .dotenv file for running the test upon the execution of this notebook. You can remove '../.dotenv'
if you already configured your .env
file locally.
'../.dotenv') load_dotenv(
Imports
import concurrent.futures
import feedparser
import os
import re
import urllib.request
from pypdf import PdfReader
from readnext.arxiv_categories import exists
from rich import print
from rich.progress import Progress
Get daily papers from arXiv
The first step is to get all the new papers from arXiv. This is done by using their daily RSS feed for any given top, or sub, category. We parse the RSS feed to extract all new papers from the archive.
get_arxiv_pdfs_url
get_arxiv_pdfs_url (category:str)
Get all the papers refferenced in the daily RSS feed on ArXiv for input ‘category’.
Get Docs Path
The synchronization process incurs downloading all the new PDF file for a category on the local file system. The DOCS_PATH
environment variable specify where the documents will be saved.
The new PDF files will be saved in the DOCS_PATH/[category]/
folder. The get_docs_path
function returns the path string to the folder for a given category.
get_docs_path
get_docs_path (category:str)
Generate the proper docs path from a category ID
Tests
assert get_docs_path("cs") == os.environ.get('DOCS_PATH').rstrip('/') + '/' + "cs" + '/'
assert get_docs_path("cs.AI") == os.environ.get('DOCS_PATH').rstrip('/') + '/' + "cs.AI" + '/'
assert get_docs_path("cs.FOO") != os.environ.get('DOCS_PATH').rstrip('/') + '/' + "cs.AI" + '/'
Delete broken PDF files
In rare occurences, it may happen that the downloaded PDF file are broken. The current process to detect and fix this issue is to try to open every downloaded PDF with PdfReader
. If an exception is thrown, then we simply delete the file and move on.
In the future, we will have to replace that mechanism with a better fail over mechanism.
The side effect of running delete_broken_pdf
is that it may delete broken PDF files from the file system for a category.
delete_broken_pdf
delete_broken_pdf (category:str)
Detect and delete broken PDF files. TODO Next iteration needs a better fail over with retry when PDF files are broken from a download.
Tests
from shutil import rmtree
from os.path import split
from unittest.mock import patch
with patch.dict('os.environ', {'DOCS_PATH': 'docs/'}):
# your code that uses os.environ.get('DOCS_PATH') here
# count the current number of PDF files in docs_path
= get_docs_path("cs")
docs_path =True)
os.makedirs(docs_path, exist_ok= os.listdir(docs_path)
pdf_files = len(pdf_files)
pdf_files_count_before
# create and empty PDF file at docs_path to produce an invalid PDF file
= get_docs_path("cs")
docs_path open(docs_path + "foo.pdf", 'a').close()
# run delete_broken_pdf
"cs")
delete_broken_pdf(
# count the number of PDF files in docs_path
= os.listdir(docs_path)
pdf_files = len(pdf_files)
pdf_files_count_after
assert pdf_files_count_after == pdf_files_count_before
# cleanup
0])[0]) rmtree(split(split(docs_path)[
Synchronize with arXiv
The sync_arxiv
function is the main function that will synchronize the local file system with arXiv. It will download all the new PDF files from arXiv and delete any broken PDF files. It downloads three PDF files concurrently.
sync_arxiv
sync_arxiv (category:str)
Synchronize all latest arxiv papers for category
. Concurrently download three PDF files from ArXiv. The PDF files will be saved in the DOCS_PATH
folder under the category’s sub-folder.