Feeds

Series of utility tools to parse and manage Small Web feeds (RSS and Atom feeds)

Imports

Feeds DB

We will want to save diffent kind of information related to the feeds we process. We will save that information locally in a lightweigth SQLite database. Here are different kind of things we will want to save:

  • feed’s ID (private key)
  • its language
  • number of entries
  • last time we downloaded it
  • type of feed
  • feed’s URL
  • feed’s title
  • feed’s description
  • feed’s author

Connect to the Database


connect_feeds_db

 connect_feeds_db ()

Connect to the feeds database

Create DB


create_articles_db

 create_articles_db (conn:sqlite3.Connection)

Create the articles database


create_feeds_db

 create_feeds_db (conn:sqlite3.Connection)

Create the feeds database

Sync Feeds

The local feeds needs to be synchroninzed with the Small Web index. Most of them will be new, but it is possible that some of the previous feeds gets removed from the feeds index. In that case, we have to remove the feed from the local system and the SQL database. The process is as fellow:

  1. check if some of the feeds got removed from the index
    1. if so, remove the feed from the local system
    2. remove the feed from the SQL database
  2. if the feed is not already on the file system, create a unique folder name for each of the new feed
  3. create a DDMMYYYY folder under the ID of the feed in the FEEDS_PATH folder
  4. download the feed’s file in that folder

The local folder and file structure should be:

  • FEEDS_PATH
    • feed_unique_folder
      • DDMMYYYY
        • feed.xml
      • DDMMYYYY
        • feed.xml

Get Feeds

The first step is to get le list of feeds for the Small Web. That list is available from the Kagi Small Web index. Then for each of those feed in the list, we will download and save them locally in the FEEDS_PATH folder.


get_small_web_feeds

 get_small_web_feeds ()

Get smallweb feeds from KagiSearch’s github repository

Tests

assert len(get_small_web_feeds()) > 0

Feed ID

We build the unique ID of a feed from its URL. We use the following steps:

  1. For every character, if it not an alpha numeric character, we replace it with a -

This method is used to make sure we can use the ID to create files and directories on the local file system, as a private key in the BD, while keeping the ID readable. It could duplicate IDs if a non-alpha numeric character is the only differenciator of a URL, in which case both will be replaced by a - and the IDs will clash. But this is unlikely in short term and is good enough for now.


get_feed_id_from_url

 get_feed_id_from_url (url:str)

Get the feed id from a feed url

Tests

assert get_feed_id_from_url('https://example.com/feed.xml') == 'https---example-com-feed-xml'

Process Removed Feeds From Index

It is possible that previously downloaded feeds get removed from the Small Web index. In this case, we get the latest version of the Small Web index, detect which was was removed, and remove it from the file system and the SQL database.


gen_ids_index

 gen_ids_index (index:list)

Return a list of IDs of the feeds in the index

Tests

index = ['https://example.com/feed.xml']
index2 = gen_ids_index(index)
assert index2 == ['https---example-com-feed-xml']

process_removed_feed_from_index

 process_removed_feed_from_index (index:list)

Process all the feeds that got removed from the SmallWeb index

Download a Feed


download_feed

 download_feed (url:str)

Download a feed from a given url

Sync all the feeds from the index


sync_feeds

 sync_feeds ()

Sync all feeds from smallweb

Language Detection

We use the library langdetect to detect the language of a feed. We use the detect method of the library. We tried other avenues like Hugging Face madels, but the language detection performance and the processing performaces with not justifying the additional complexity for now (results were worse and much slower). You can check the file 01_language_detection.ipynb for more details.


detect_language

 detect_language (text:str)

Detect the language of a given text

Tests

assert detect_language('This is a test') == ''
assert detect_language('This is a test' * 128) == 'en'

assert detect_language('Ceci est un test') == ''
assert detect_language('Ceci est un test' * 128) == 'fr'

assert detect_language('これはテストです') == ''
assert detect_language('これはテストです' * 128) == 'ja'

assert detect_language('이것은 테스트입니다') == ''
assert detect_language('이것은 테스트입니다' * 128) == 'ko'

assert detect_language('<br /><br /><br /><br /><br /><br /><br /><br /><br />This is a test') == ''
assert detect_language('<br /><br /><br /><br /><br /><br /><br /><br /><br />This is a test' * 128) == 'en'

Parse a Local Feed

For any given feed URL, let’s parse the local feed we downloaded for it and return an internal dictionary that represents it, whatever if it is a RSS or Atom feed. The internal representation of an small web article if represented by a namedtuple


parse_feed

 parse_feed (url:str, feed_path:str=None)

Parse a feed from a given path and url

Sync Feeds DB from Local Cache

We do download all and every feeds locally and save them in a time stamped folder of the day where they were downloaded. We proceed that way such that we don’t have to redownload all the feeds every time we change an internal process that requires us to parse the feeds again. We can just parse the local cache of the feeds we downloaded.

The synchronization occurs by simply creating one transaction per feed using INSERT OR INGORE which appears to be the fastest way to only add the new feeds and ignore the one that are already in the DB. This is also by far the simplest logic to implement and to reason about.

If the database is empty, then it will be fully populated with the cache of the provided DDMMYYY as input.


sync_feeds_db_from_cache

 sync_feeds_db_from_cache (ddmmyyyy:str='20092023')

Sync the feeds database from the cache. The cache by default to use is the one from today. It is possible to use a different cache by passing a different date in the format DDMMYYYY


sync_feeds

 sync_feeds ()

Sync all feeds from smallweb

Update the language of the feeds

The next step is to update the primary language of a feed. This is done by checking what is the highest number of articles with a certain language.

What the following SQLite query does, it to group by language and count the number of articles for each language. Then we order by the count in descending order and we limit the result to 1. This way, we get the language with the highest number of articles.

We have to take that result and to update the feeds table with the new language.

Note: it doesn’t seems possible to do that in SQLite directly, if I am missing some feature of the query language, please propose a better solution and submit a PR.


get_articles_lang_per_feeds

 get_articles_lang_per_feeds ()

Get the count of articles per language per feed

Update the feeds table with the new languages

The next step is to take those results and to update the feeds table with the new language.


update_feeds_with_languages

 update_feeds_with_languages (rows)

Update the feeds database with the language of the feed

Clean Small Web Index

This utility function is used to remove all the feeds that have been tagged as non-english. For the moment, only the ones that have been tagged with a non-english language will be added, the ones that the current heuristic couldn’t determine the core language will be left in the index. Further work will be required for them.


get_non_english_feeds

 get_non_english_feeds ()

Return the list of non-english feeds URL

Next step is to remove the feeds URLs from the Small Web index.


get_cleaned_small_web_index

 get_cleaned_small_web_index ()

Return the cleaned small web index

Validate Small Web Index File

One thing that needs to be done is to check every incoming PR of the smallweb repository to see if the new proposed feeds from the contributors are valid or not. Do enabled this in a PR check, we will add a few functions here to validate a new proposed index file against the one of the main branch.


diff_index_file

 diff_index_file (new_index_file:str)

Diff an input index file with the one currently on the main branch of the SmallWeb repository

Now that we have the list of new feeds from what is currently in the index, the next step is to make sure that those feeds are valid according to the Kagi Small Web index guidelines. The first thing we validate is to make sure the feed is an English feed. Other validation checks could be added in the future.


is_feed_english

 is_feed_english (url:str)

Validate a feed from a given url is an English feed

Given a new index file, the validate function will check which are the new feeds, will get and parse each of them to deterine their validity. An empty list will be returned if all the feeds are valid, otherwise a list of the invalid feeds will be returned.


validate_new_index_file

 validate_new_index_file (new_index_file:str)

Validate a new index file by checking that all the feeds are in English. Returns an empty list if the new feeds are all valid. Returns a list of URLs with each of the feed that are not valid.