A new Ubuntu wiki, Part 4: Archiving

We have made a public, read-only archive of the two Ubuntu wikis, in preparation for their upcoming deprecation.

:books: Posts in the Ubuntu wiki project series

  1. Announcing the project to make a new wiki
  2. Overview of features in the new wiki
  3. How the wiki relates to other content platforms
  4. Making public archives of the old Ubuntu wikis :backhand_index_pointing_left:

Finding the archives

You can now access, read, and clone the wiki archives on GitHub:

https://github.com/ubuntu/wiki-archives

The archive includes pages from wiki.ubuntu.com and help.ubuntu.com/community/CommunityHelpWiki, with wiki pages organized into alphanumeric folders.

Searching the archives

You can find pages quickly using GitHub’s built-in search:

Searching through the archive on GitHub.

:information_source: Note
We have made every effort to retain all non-empty pages.
If any legitimate page is missing, it is not intentional, and we will try to restore it.

Reading the archives

The files have been converted to MediaWiki syntax, which renders nicely on GitHub.

This generally results in nice, well-formatted pages:

Downloading the archives

We reduced the size of the archive from about 60GB to 0.34GB.

Cloning the wiki-archives repo to my local machine takes just over 10 seconds:

time git clone git@github.com:ubuntu/wiki-archives.git
...
...
Receiving objects: 100% (124534/124534), 150.47 MiB | 24.88 MiB/s, done.
...
Updating files: 100% (50072/50072), done.
________________________________________________________
Executed in   12.96 secs    fish           external
Worried that content was lost because of the reduction in size?

Read on.

We have provided tarballs that contain images, attachments, and multiple page versions. Removing these was part of how we achieved the size reduction, which was necessary for making a plaintext archive that was easy to search and download.

Searching locally

Fuzzy-searching through the cloned wikis in an editor is fast:

Fuzzy searching "shuttle" in NeoVim.

Getting tarballs

Tarballs are included in the Releases page of the GitHub repo:

https://github.com/ubuntu/wiki-archives/releases

These might be useful if you:

  • Want wiki pages in the original Moinmoin syntax
  • Need images and attachments from the original wiki
  • Require individual page versions

Although the tarballs contain images and attachments, they are still significantly smaller than the original backups of the wikis, as files including user information, caches, and other data have been removed.

Purpose of the Ubuntu wiki archives

Preserving Ubuntu history

The Ubuntu wiki has existed almost as long as Ubuntu itself. For over 20 years, members of the community and Ubuntu developers have contributed.

The archives will serve to preserve this important part of Ubuntu’s history and the contributions made by people through the years.

Supporting future migrations

Many of the wiki pages that received the highest traffic have already been migrated to new homes, such as the Ubuntu Project documentation.

Still, there are likely pages that are viewed infrequently but that could still be important when an individual or team has a particular issue.

If you are worried that a page has been lost, you can find it in the wiki archive.

If you are worried that you missed an opportunity to migrate your team’s content, you can find it in the wiki archive.

How the wikis were archived

Our goal was to make the archive available, searchable, and cloneable as a GitHub repository.

Challenges with the wiki sources

The combined size of both wikis was enormous.

We could have simply made tarballs of the raw backups available. From my own experience, however, downloading, extracting and even just deleting the full backups was demanding:

Deleting one of the wikis from a HDD took hours.

The backups were also difficult to navigate, given the complexity of the file system, the existence of multiple page versions, and the URL-encoding of page names.

For anyone wishing to find a page, the process would be slow and difficult.

Reducing the size

To reduce the size, every image, attachment, and cache was removed.

In addition, any non-latest version of a page was deleted.

Lastly, the number of spam pages, some of which existed for years, was reduced.

Cleaning up the files

Each wiki had a pages directory, containing subdirectories named after wiki pages, themselves consisting of a subdirectory with different versions of the pages.

The directories named after the pages were almost unreadable due to some type of URL encoding.

We removed the encoding and simplified the directory tree: each wiki now consists of an alphanumeric list of folders, each containing individual files named after the wiki page itself.

Converting the syntax

We made the decision to convert the original Moinmoin syntax to MediaWiki syntax, for the following reasons:

  • It will make it easier to migrate content to the new MediaWiki-based Ubuntu wiki, as people won’t need to do a full syntax conversion each time
  • It will make it easier to read the wiki pages on GitHub, which supports MediaWiki syntax but not Moinmoin syntax

:information_source: The syntax conversion is not perfect
There are some inconsistencies, partially because the original files themselves didn’t have consistent Moinmoin syntax. Inter-page linking also does not work in the archive. However, it is not our intention for the archives to be a perfect reading experience, nor a functioning wiki. Above all, we wanted the archives to support people who want to find and migrate content.

Acknowledgements

I want to thank @marek-suchanek for contributions and discussions about the archiving project, @rkratky for initial advice on different approaches to archiving content, @nickbellol for first highlighting the need to find an archiving solution, and Canonical’s IS team for providing the wiki backups.

12 Likes

Thank you. I haven’t fully absorbed the rest of your posting.

However, I caught this short reference:

I can only imagine that such a reduction of size implies that various space-consuming elements (i.e. snapshots that were relevant to discussions) have been ā€œdroppedā€. My own sense is that dropping such images would create a void, and consequently confusion, in attempts to understand, when going thru discussions which, undoubtedly often, would have been geared to identifying what is broken, how to identify, where to identify, etc.

Does anyone else feel this way and, if so, is there any way to be more selective as to the topics or categories that retain such images, and which are of a nature for which the images are too ā€œephemeralā€ to have any retention value?

Just raising a point of concern for discussion.

:slight_smile:

Upon full absorption, folks will observe that…

  • There are two sections of the original post explaining what was removed.
  • That the old wiki was archived, not deleted, and the archive is available on GitHub

I think we’re all open to suggestions on how that might be accomplished. It’s very hard to predict the rate of change in software that we don’t create.

Any community contributor who wants to browse the old wiki archive for high-quality material to copy to the new wiki is welcome to. Be sure to keep a record of what you did for your Ubuntu Membership application.

3 Likes

Are you saying that for any retained conversation, which included image snapshots, all of those snapshots have been retained?

If I am following you correctly, are you concerned about images/screenshots not being preserved?

As I mentioned in the original post, tarballs are provided on the Releases page of the GitHub-based archive that have all images and attachments included. Both the plaintext archive and the Release page are linked in the post. If you want access to the images, you can download the tarballs.

As mentioned in the initial announcement of the wiki project back in November 2025, the old wikis are scheduled for deprecation in August 2026. Until that time, the wikis will remain publicly available. If you want to get something from the wikis themselves in the coming months, they remain live and accessible.

Six months in advance of that date, we have made a plaintext archive available on GitHub. That archive contains virtually all pages from the wiki, except for hundreds of stubs, spam and empty pages. We removed these, along with the images, and old versions of pages, because it’s not possible to make a >50GB GitHub repo, and such an archive would be difficult for most people to search and navigate.

We did this to support people who may want to find or migrate content in future, the vast majority of which is text-based. A large number of wiki pages do not include any images. We think that having a lean archive that can be efficiently searched and copied is more useful — in the context of future migrations, in particular — than an unwieldy archive that includes all of the data. Yet also, again, we do provide tarballs for those who need them.

I’m a little confused about your use of ā€œconversationā€ and ā€œdiscussionā€. It seems to me that most of the old wiki pages that I have encountered are authored pages rather than discussions.

3 Likes

Wow, that archiving work must have been a real pain. Kudos for achieving such a nice result where it’s both really usable for quick search, and also provides all the needed data in the archives. :clinking_beer_mugs:

2 Likes

@skia — I can confirm that, yes, it was definitely a pain :grimacing:

Cheers!

2 Likes