small-web-dataset
The Small Web Dataset is a command line tool used to generate a dataset by aggregating of all the data from the Kagi Small Web index.
What is the Small Web? The Small Web is the web of independent websites that are not part of the big tech platforms. Here are some more reference about the concept [1][2][3][4][5].
There are different purpose for this tool and the dataset it creates:
- help analyzing the Kagi Small Web index, to detect and eventually remove the sites that doesn’t comply with the policy of the index
- create a dataset of all the sites that compose the index. This dataset is a very specialized subset of websites that are created and maintained by independent people, mostly old school bloggers. This dataset can be used for different specialized ML training, for example to train a classifier to detect the Small Web sites from the Big Web sites, etc.
Install
To install the command line tool, you simply have to:
git clone https://github.com/fgiasson/small-web-dataset.git
cd small-web-dataset
make build
make install-local-buildThis will clone the repository, build the command line tool and install it in your local Python environment.
Configure
You have to make those environment variables available in your environment:
| Variable | Description |
|---|---|
FEEDS_PATH |
The path where you want to save all the feeds on your local file system |
DB_PATH |
The path where you want to save the SQLite dataset on your local file system |
How to use
You can make sure that the command line tool is installed by running, and that the latest version is available by running:
small-web-dataset versionYou can get the help documentation by running:
small-web-dataset --helpYou can check what are the current configuration options for the tool in the current environment by running:
small-web-dataset configTo create the dataset, you simply have to run the following command:
small-web-dataset sync-feedsThis command will do three things:
- it will download all the RSS and Atom feeds from the Kagi Small Web index in the
FEEDS_PATHfolder - it will read all the local feeds files and import them in a local SQLite database in the
DB_PATHfolder - it will infer the core language of a feed from the language used to write the articles in the feed, and it will add this information in the database
Optionally, if you already have a local cache of the feeds and you only want to update/recreate the database, you simply have to specify the DDMMYYYY folder of the feeds you want to process:
small-web-dataset sync-feeds 18092023