Garble

Garble web time machine

Garble is an internet wayback machine ready for your local setup! It is written in Haskell and backed by a Postgres database.

Dependencies: persistent, conduit, yesod, warp, http-conduit, tagstream-conduit et al.

Get started

Database setup

Install and start the PostgreSQL server. Create a user “garble” and a database “garble” (the user “garble” should be its owner; it also needs login capability). The schema will be automatically created on first start.

Garble

Download and compile Garble using cabal. Example:

cabal sandbox init
cabal install --dependencies-only
cabal build

Use the admin tool to setup the database schema and your preferences. Example:

cabal run admin -- shell
[... migrations ...]
0/0> set directory "/var/garble"
Okay, set.
0/0> set admin "me@myself.com"
Okay, set.
0/0> set recent for 96 hours
Okay, set.

If you like, you can already add a download job:

0/0> enqueue "https//example.com/"
New: "https//example.com"
Job id: 1

In the default configuration, Garble will recurse three levels on the same host, and one level into outgoing links. TODO: document how to change this.

The daemons

To actually execute the queued download jobs you need to start the garbled daemon:

cabal run garbled

Garbled will now download from the queued URIs and store them to disk. HTML documents will be searched for hyperlinks and included resources (such as style sheets, images, scripts), which will be added to the download queue.

For the web interface we need yet another daemon:

cabal run delivery

The delivery daemon will listen on localhost:3020 and accept the following routes:

/c/${CID}                 -- get the document content with content id ${CID}
/d/${DID}                 -- get the content of the document with document id ${DID}
/h/${HASH}                -- get the document content with store hash ${HASH}
/t/${DATETIME}?uri=${URI} -- get the document content for URI ${URI} most close to ${DATETIME}
/l?uri=${URI}             -- get the last known document content for URI ${URI}

Human users will most commonly use one of the latter two routes. The content id or document id are useful for debugging purposes. The hash route is used for included style sheets and images.

If the delivered content is an HTML document, all contained hyperlinks and resource references are adapted to point to the closest matching known content. If the target is not known to garble yet, an absolute URI to the original location is inserted.

Example: You request “/t/2018-03-01T20:00:00?uri=http//example.com/”. The page originally contains a hyperlink to http://example.net/, which is known to garble. Hence the link is replaced by “/t/2018-03-01T20:00:00?uri=http://example.net/”.

It might also contain a hyperlink to the relative path “/some/strange/things”, which is not tracked by Garble. Hence the link is replaced by “http://example.com/some/strange/things”.

Admin stuff

The ‘admin’ tool is specifically designed to provide an easy-to-use interface for common administration tasks.

Add a new download job

The ‘enqueue’ command adds a new job to the queue:

42/255> enqueue "http//example.com/"
New: "http://example.com/
Job id: 256

Change the store location

The store location may be changed using the ‘set directory’ command:

42/256> set directory "/the/new/location"

This does however only affect new files. To move already downloaded files to the new location, use the ‘move’ command in the admin tool:

42/256> move

Both actions may be combined:

42/256> move "/the/new/location"

Remove duplicate content

The downloader will automatically try to avoid downloading duplicate content by observing the Last-Modified header. However, there are sites that don’t provide the Last-Modified header, and there are occurences of the same content being found at different URIs. Thus, some duplicate content will pile up over time. As a counter-measure, Garble provides a deduplication command:

42/256> dedup
Removed duplicate /z/garble//2018-03-25/2018032504df9b0f9c578733239a891bbfbd98518cda16a1670b655921ed5f3928d10ef88f633e0b90e61d939f5da4232d6b1803cbae62d59a51c0a4297ff34b8bd4d760alldagif.gz
Removed duplicate /z/garble//2018-03-25/20180325a993084c7423d75dbf6648f4e5a375acf2830fa56d5cd1903043f6b4d03d0b57e7a4df600ae631b4804ebb138d8728109aceb7a009b9eb3228550b9f592d6c9653322gif.gz
[...]

List the current queue

Table entries have the following order: job id, date and time of job creation, date and time of the job being queued for immediate execution, permitted recursion levels (same host/outgoing), URI.

42/256> list queue
43  2018-03-24T15:38:00  2018-03-25T03:05:19  2/1  https://example.com/home
44  2018-03-24T15:38:00  2018-03-25T03:05:20  1/1  https://example.net/static/tree.png
[...]

List recently finished jobs

Table entries have the following order: job id, date and time of job creation, date and time of the job being finished, measured file size, transmitted MIME type, URI.

42/256> list recent
42  2018-03-24T15:38:34  2018-03-25T03:09:02  137 KiB  text/html; charset=utf-8  https://example.com/articles/2/
41  2018-03-24T15:38:34  2018-03-25T03:09:01  146 KiB  image/png  https://example.net/static/car.png

Permanent progress update

Use “follow” as a command-line argument to get a progress update every 30 seconds:

cabal run admin -- follow
Finished 117270 jobs out of 151218, that's 77 %
Finished 117338 jobs out of 151218, that's 77 %
Finished 117402 jobs out of 151218, that's 77 %
Finished 117461 jobs out of 151218, that's 77 %
[...]
Older versions Editor Timestamp
Garble m@doomanddarkness.eu 2018-04-20 22:07:46 UTC
Garble m@doomanddarkness.eu 2018-04-20 21:57:42 UTC
Garble m@doomanddarkness.eu 2018-04-20 21:49:04 UTC
Garble m@doomanddarkness.eu 2018-04-20 21:47:46 UTC
Garble m@doomanddarkness.eu 2018-04-20 21:47:06 UTC
Garble m@doomanddarkness.eu 2018-04-20 21:46:09 UTC