assert len(get_small_web_feeds()) > 0
Feeds
Imports
Feeds DB
We will want to save diffent kind of information related to the feeds we process. We will save that information locally in a lightweigth SQLite database. Here are different kind of things we will want to save:
- feed’s ID (private key)
- its language
- number of entries
- last time we downloaded it
- type of feed
- feed’s URL
- feed’s title
- feed’s description
- feed’s author
Connect to the Database
connect_feeds_db
connect_feeds_db ()
Connect to the feeds database
Create DB
create_articles_db
create_articles_db (conn:sqlite3.Connection)
Create the articles database
create_feeds_db
create_feeds_db (conn:sqlite3.Connection)
Create the feeds database
Sync Feeds
The local feeds needs to be synchroninzed with the Small Web index. Most of them will be new, but it is possible that some of the previous feeds gets removed from the feeds index. In that case, we have to remove the feed from the local system and the SQL database. The process is as fellow:
- check if some of the feeds got removed from the index
- if so, remove the feed from the local system
- remove the feed from the SQL database
- if the feed is not already on the file system, create a unique folder name for each of the new feed
- create a
DDMMYYYY
folder under the ID of the feed in theFEEDS_PATH
folder - download the feed’s file in that folder
The local folder and file structure should be:
FEEDS_PATH
feed_unique_folder
DDMMYYYY
feed.xml
DDMMYYYY
feed.xml
Get Feeds
The first step is to get le list of feeds for the Small Web. That list is available from the Kagi Small Web index. Then for each of those feed in the list, we will download and save them locally in the FEEDS_PATH
folder.
get_small_web_feeds
get_small_web_feeds ()
Get smallweb feeds from KagiSearch’s github repository
Tests
Feed ID
We build the unique ID of a feed from its URL. We use the following steps:
- For every character, if it not an alpha numeric character, we replace it with a
-
This method is used to make sure we can use the ID to create files and directories on the local file system, as a private key in the BD, while keeping the ID readable. It could duplicate IDs if a non-alpha numeric character is the only differenciator of a URL, in which case both will be replaced by a -
and the IDs will clash. But this is unlikely in short term and is good enough for now.
get_feed_id_from_url
get_feed_id_from_url (url:str)
Get the feed id from a feed url
Tests
assert get_feed_id_from_url('https://example.com/feed.xml') == 'https---example-com-feed-xml'
Process Removed Feeds From Index
It is possible that previously downloaded feeds get removed from the Small Web index. In this case, we get the latest version of the Small Web index, detect which was was removed, and remove it from the file system and the SQL database.
gen_ids_index
gen_ids_index (index:list)
Return a list of IDs of the feeds in the index
Tests
= ['https://example.com/feed.xml']
index = gen_ids_index(index)
index2 assert index2 == ['https---example-com-feed-xml']
process_removed_feed_from_index
process_removed_feed_from_index (index:list)
Process all the feeds that got removed from the SmallWeb index
Download a Feed
download_feed
download_feed (url:str)
Download a feed from a given url
Sync all the feeds from the index
sync_feeds
sync_feeds ()
Sync all feeds from smallweb
Language Detection
We use the library langdetect to detect the language of a feed. We use the detect
method of the library. We tried other avenues like Hugging Face madels, but the language detection performance and the processing performaces with not justifying the additional complexity for now (results were worse and much slower). You can check the file 01_language_detection.ipynb
for more details.
detect_language
detect_language (text:str)
Detect the language of a given text
Tests
assert detect_language('This is a test') == ''
assert detect_language('This is a test' * 128) == 'en'
assert detect_language('Ceci est un test') == ''
assert detect_language('Ceci est un test' * 128) == 'fr'
assert detect_language('これはテストです') == ''
assert detect_language('これはテストです' * 128) == 'ja'
assert detect_language('이것은 테스트입니다') == ''
assert detect_language('이것은 테스트입니다' * 128) == 'ko'
assert detect_language('<br /><br /><br /><br /><br /><br /><br /><br /><br />This is a test') == ''
assert detect_language('<br /><br /><br /><br /><br /><br /><br /><br /><br />This is a test' * 128) == 'en'
Parse a Local Feed
For any given feed URL, let’s parse the local feed we downloaded for it and return an internal dictionary that represents it, whatever if it is a RSS or Atom feed. The internal representation of an small web article if represented by a namedtuple
parse_feed
parse_feed (url:str, feed_path:str=None)
Parse a feed from a given path and url
Sync Feeds DB from Local Cache
We do download all and every feeds locally and save them in a time stamped folder of the day where they were downloaded. We proceed that way such that we don’t have to redownload all the feeds every time we change an internal process that requires us to parse the feeds again. We can just parse the local cache of the feeds we downloaded.
The synchronization occurs by simply creating one transaction per feed using INSERT OR INGORE which appears to be the fastest way to only add the new feeds and ignore the one that are already in the DB. This is also by far the simplest logic to implement and to reason about.
If the database is empty, then it will be fully populated with the cache of the provided DDMMYYY as input.
sync_feeds_db_from_cache
sync_feeds_db_from_cache (ddmmyyyy:str='20092023')
Sync the feeds database from the cache. The cache by default to use is the one from today. It is possible to use a different cache by passing a different date in the format DDMMYYYY
sync_feeds
sync_feeds ()
Sync all feeds from smallweb
Update the language of the feeds
The next step is to update the primary language of a feed. This is done by checking what is the highest number of articles with a certain language.
What the following SQLite query does, it to group by language and count the number of articles for each language. Then we order by the count in descending order and we limit the result to 1. This way, we get the language with the highest number of articles.
We have to take that result and to update the feeds
table with the new language.
Note: it doesn’t seems possible to do that in SQLite directly, if I am missing some feature of the query language, please propose a better solution and submit a PR.
get_articles_lang_per_feeds
get_articles_lang_per_feeds ()
Get the count of articles per language per feed
Update the feeds table with the new languages
The next step is to take those results and to update the feeds
table with the new language.
update_feeds_with_languages
update_feeds_with_languages (rows)
Update the feeds database with the language of the feed
Clean Small Web Index
This utility function is used to remove all the feeds that have been tagged as non-english. For the moment, only the ones that have been tagged with a non-english language will be added, the ones that the current heuristic couldn’t determine the core language will be left in the index. Further work will be required for them.
get_non_english_feeds
get_non_english_feeds ()
Return the list of non-english feeds URL
Next step is to remove the feeds URLs from the Small Web index.
get_cleaned_small_web_index
get_cleaned_small_web_index ()
Return the cleaned small web index
Validate Small Web Index File
One thing that needs to be done is to check every incoming PR of the smallweb repository to see if the new proposed feeds from the contributors are valid or not. Do enabled this in a PR check, we will add a few functions here to validate a new proposed index file against the one of the main
branch.
diff_index_file
diff_index_file (new_index_file:str)
Diff an input index file with the one currently on the main
branch of the SmallWeb repository
Now that we have the list of new feeds from what is currently in the index, the next step is to make sure that those feeds are valid according to the Kagi Small Web index guidelines. The first thing we validate is to make sure the feed is an English feed. Other validation checks could be added in the future.
is_feed_english
is_feed_english (url:str)
Validate a feed from a given url is an English feed
Given a new index file, the validate function will check which are the new feeds, will get and parse each of them to deterine their validity. An empty list will be returned if all the feeds are valid, otherwise a list of the invalid feeds will be returned.
validate_new_index_file
validate_new_index_file (new_index_file:str)
Validate a new index file by checking that all the feeds are in English. Returns an empty list if the new feeds are all valid. Returns a list of URLs with each of the feed that are not valid.