Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Open-source app releases first complete copy of English Wikipedia with images (sourceforge.net)
85 points by gnosygnu on Nov 26, 2013 | hide | past | favorite | 74 comments


Some people wonder what this is good for. For me, it would have been great, when I worked on a training ship as the IT support. Students were not allowed access to the internet (nor could our meager satellite uplink have supported it), but for my second stint on the ship, I wanted to see if I could provide them with Wikipedia. So, I grabbed the dump, set up MediaWiki, and imported it...and waited...for days and days and eventually the thing loaded, and it was great. But, it would have been super nice to have an easy installer to handle all that. So, yes there are use cases out there.


That's pretty impressive. I never had the patience to sit through a full MediaWiki import for en.wikipedia.org.

Just to be clear, XOWA isn't an installer for MediaWiki, but it's own app. This allows it to avoid the dependency on the entire MediaWiki tool-chain (apache, php, mysql, MediaWiki). Unfortunately, this means that XOWA has to reproduce the same logic, which is quite a challenge...


It is indeed a challenge. The mediawiki syntax is the weirdest mess I have ever had to parse. There is no spec, real world usage deviates significantly from the help docs, and it's a Turing complete language with heaps of backwards compatibility hacks. So if you have something reasonably complete and correct than kudos to you!


Thanks. The syntax was challenging, especially all the template syntax ("{{my_template|{{{argument1|defaultvalue|{{nested_template}}}}}}}"). Fortunately, the new lua module should eventually replace the template syntax, which should make it easier for future parsers.


The visual editor uses a new parser, Parsoid, which has been implemented separately in node.js (iirc). That may be the answer...


Yup. It also has its own DOM, rather than continuously adding to one string and repeatedly running regex's on it (which is what MediaWiki does today).

I was already pretty far along with my own parser before Parsoid was usable though. (and my parser has its own DOM / hooks)


MediaWiki is such an astoundingly fugly piece of software.


Wouldn't it be easier to include all the original tools in a packaged form instead of reproducing their logic?


Yes, this would be the ideal approach, but it can become quite complicated (b/c the tool-chain needs to be installed for different machines). In addition, the official.xml importer (importDump.php) is not really up to the task (slow / sometimes buggy).

If you're interested in going this route, you can look at http://www.nongnu.org/wp-mirror/. This should build a local MediaWiki instance with one click. Keep in mind that it's a bit slow: it takes two days to build simple.wikipedia.org with images. In contrast, XOWA sets this up in about 30 min


I read the next comment after this and it started with '> Space required during initial..'

And I thought ah yes 'Outer Space' that could be another location an offline version could help.


> Space required during initial import: multiply the dump file by 8. For example, for English Wikipedia, the dump size is 10 GB. You should at least have 80 GB space free for the import process

> Space required when completed: multiply the dump file by 2.5. For example, for English Wikipedia, the dump size is 10 GB. When done, your English Wikipedia will be 25 GB.

Ouch, looks like I won't be trying this here any time soon.


It's only ~3 GiB once you leave out the articles about individual Pokemon.


Wikipedia doesn't have that. For most Pokémon, they have summary pages like this

http://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon_(252%E2%80...


You can try the low-space import. There are instructions in XOWA at home/wiki/Help:Options/Import. It takes longer (8-10 hours) but only needs 35 - 40 GB. (which is still a lot).

In the end, you're still going to need about 25 GB for English Wikipedia. If you want something smaller, you can try one of the other wikis (for example, Wiktionary, Wikiquote, Wikisource, etc.) Each of these are generally about 5 GB.

Hope this helps.


It's more the bandwidth required - even on a 10 megabit (optimistic) connection, a 10GB download takes quite a while.


Fair enough. It takes about 2 - 3 hours to download for me (standard US residential connection). I sometimes let it run overnight.


If you don't need the "with images" option, have a look at Aard Dictionary [1], the precompiled english Wikipedia has around 8GB at the moment. It has an app for android also.

[1]: http://aarddict.org


You could try the version without images from Kiwix. I keep it on my MacBook (and also at work) to save Internet trips and for when out and about with no access.

EDIT: Just read further down the comments and noticed the mention of Kiwix, apologies for the duplication.


See also: Kiwix

For one thing, it has an Android app, and it's easy to put the whole thing on your external SD card. It also provides an index for full-text search.

http://www.kiwix.org/wiki/Main_Page


Kiwix's Android app and the full-text search are both great features.

However, I'll point out that Kiwix has not updated English Wikipedia since January 2012. Also, XOWA works directly with the Wikimedia dumps (http://dumps.wikimedia.org/backup-index.html) so it's (a) always up to date and (b) can work on any wiki (Kiwix needs to release the zim file first)

Also, XOWA can run from an external SD card (including FAT32 formatted ones)


Another Android option is Fastwiki [1]. No images, but provides a conversion tool to convert native Wikimedia dumps. Also works with older Android versions.

[1] http://fastwiki.qwrite.info/en/index.html


My use case is exporting a private mediawiki for use at a prison.

Kiwix is a lot more polished than this, especially the user experience.

This looks interesting becuase reproduces mediawikis very well. The text-only dump imported pretty well once I figured it out.

The kiwix ZIM files are a pain to prepare, and use the public facing API that can be pretty unreliable, especially if your mediawiki hosting is of poor quality. It also forces you to hand curate what you export.

Using the direct XML dumps prepared server side is much cheaper, but then you lose the curation. The image situation is unclear here. There are some really confusing instructions about it, but I wasn't able to figure it out.

I'll keep using and recommending kiwix for now, but this could be promising.


Thanks for giving it a try. Kiwix is definitely more polished in UI, especially as it has been around for 5+ years. I'd like to think that though XOWA isn't as friendly UI wise, it offers a lot more power / options.

Regarding images. There is some assembly acquired, but I tried to make the instructions as simple as possible. If you look at http://xowa.sourceforge.net/setup_simplewiki.html, then there should be two steps:

* Download the .7z file from from archive.org: http://archive.org/details/Xowa_simplewiki_2013-10-30_images... * Unzip the .7z file to your XOWA directory. If you're on Windows and have C:\xowa as your folder, you should get a file called C:\xowa\file\simple.wikipedia.org\fsdb.main\fsdb.abc.sqlite3 as well as many others

enwiki is a little more difficult, but only in that it requires downloading more files.

Let me know if you run into other issues. I'm going off to work now, but I'll check again later.

EDIT: I forgot to add that if you set up ImageMagick and Inkscape (installation instructions are on XOWA's Main_Page), you can download images dynamically for each article (i.e.: you don't need to download the entire image dump first)


Thanks your your reply. I did see the things you mention regarding images, but the gap is that I'm exporting a private mediawiki, not one of the well known wikis that you have added explicit support for.

I tried tar'ing up my images directory from the server, and unpacking them in a few locations on the filesystem that looked like likely places, but that didn't work. The filesystem layout was kinda confusing with the "user" and "wiki" separation.

How would one prepare a similar image database for an unsupported wiki? I expect this is a custom thing you prepared as opposed to the xml text dump which is a standard mediawiki dump format.

As to the imagemagik part, it doesn't work for an unsupported wiki. Also, it would be impractical for me to manually crawl my whole site triggering downloads of images, and even if I did that, it is unclear how to package and deploy it. The deployment needs to be completely offline becuase there is no Internet at the prison.

Overall, setting up one of the well known wikis is probably pretty smooth, but the private wiki requires a lot of technical knowledge about implementation details that make this tool impractical for unskilled users. Right now, kiwix deployment it is close to ideal. I just need to instruct the unskilled user to replace the ZIM file.

There is one small deficiency in the kiwix deployment in that the automatic index files are user specific unless prepared in advance and recorded in the libary.xml file, so in practice I had to prepare a script to make sure the index and library were right. The actual deployment is "copy zim files to this dir, then double-click on this script"


Hey. I just happened to check this thread and saw your response.

To answer your question, yes: the image databases were prepared with expectations of a standard Wikimedia wiki. These wikis have a standard file layout of wikipedia/wikidomain/thumb/hash0/hash01/name_of_file/thumbnail_file/; EX: wikipedia/commons/thumb/9/97/The_Earth_seen_from_Apollo_17.jpg/270px-The_Earth_seen_from_Apollo_17.jpg.

If you're using a MediaWiki installation, your files should be laid out similarly. You can change the XOWA config file to explicitly specify this WMF layout. XOWA allows the user to work directly with the WMF tarballs, so this should work for you as well. You can look at this thread for another user's attempts: https://sourceforge.net/p/xowa/discussion/general_archived/t... If you have questions, feel free to ask / post.

The other alternative is that XOWA should have the ability to read from a non-Wikimedia directory. Another user asked for this for his own private wiki: https://sourceforge.net/p/xowa/tickets/159/. In this scenario, you'd have all your files in some root directory (C:\images) and XOWA would index the directory and look-up the file by filename. You would probably need imageMagick and inkscape installed though.

Regarding your other point: I will probably centralize all the directories, instead of spreading them out between /wiki/, /file/, /user/. I had a reason for this layout, but it's causing confusion among a few users. You could always zip the files with relative paths, and instruct the users to unzip the zip. For example, the XOWA wikiquote package is one zip file: https://archive.org/details/Xowa_enwikiquote_2013-11-19_comp.... If you unzip it in the /xowa/ dir, it will automatically put all files into relevant folders

In the end, if you have a routine set up for kiwix, you're probably best sticking with it. Keep in mind that XOWA does offer some other nice features that you may / may not need. (editable wiki pages; Wikimedia Lua code). It also offers a lot customization. For example, one of the users added Mathjax to XOWA on his own. (he then proceeded to add a lot more: sorting / collapsing, wikidata skin, redlinks, etc.)

Let me know if you're interested, and I'll see what I can do to help. Otherwise, thanks for the use case scenario. It's definitely something I'll consider supporting in the future!


Thanks for being so responsive. I'll take a look at your suggestions later, and I'll continue this thread on the mailing list.


The Kiwix Android app is great. Unfortunately (for me), it requires Android 3.0+. I spent some time trying to port it to 2.1 for an old large format device I have, but could only get it working on 2.2.


Since Wikipedia doesn't have ads (and I'm guessing gets no real benefit from number of clicks), maybe this could be a nice way of lightening their server load and reducing some of their costs.


Probably not. In your ordinary browsing, you're not likely to download anywhere near the entirety of wikipedia. Plus, spreading out your requests over time rather than downloading it all at once is more gentle.


Good observation, but I just want to point out that the full dumps are downloaded from different servers. They are even mirrored by other institutions. See https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_...


They could set up a torrent as well and almost assuredly people will seed it full time on servers.


There is a torrent distribution channel: see https://meta.wikimedia.org/wiki/Data_dump_torrents

However it suffers the number one torrent issue: they do not tolerate change. This means that

- When an article changes, you need to generate a new torrent - When a new torrent describes the archive, it needs to be downloaded from scratch by all peers, so that the maximum number of peers are available for a newcomer.

I hope you'll understand that this is not the official way to distribute archives...


Yeah, but they're working on incremental updates, so this should make the "from scratch" part much easier. You can look at https://www.mediawiki.org/wiki/Incremental_dumps and http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-Aug... if you're interested.


I wouldn't mind having something like this for long flights without wifi.

Edit: Kiwix looks nice too, thanks!


As a thought experiment, assume you were downloading and printing on low acidic paper (or otherwise storing for the long term) the entire contents of Wikipedia for insurance against some near extinction event. Is there a way to allow future generations to correctly interpret the arbitrary text on a page into meaningful information with only using "the page" as a medium?

(I tried googling the topic, but I'm very lost as to even where I should start. The main question that I'm interested in could be summarized as, "How do we ensure our knowledge as a species/society is not lost in an unintelligible format?")


If you're interested, they update the calculation for printed pages here: https://en.wikipedia.org/wiki/Wikipedia:Size_in_volumes. I think the estimate is for 1.9 million pages

Also, there was another page along the lines of what you're thinking: https://en.wikipedia.org/wiki/Wikipedia:Terminal_Event_Manag.... Although it was an April Fool's day joke, it does give some idea of what's involved.


Well practically speaking you could just use images and then have the first book as a sort of pictionary with words and pictures that demonstrate them. Once they figure out enough words they can start to figure out other words that can't easily be explained by a single picture, and grammar can be worked out from there. Sort of like the Rosetta stone.

But this is useless to the post-apocalyptic hunter gatherer, civilization would have to be reestablished by then. And hopefully they at least understand the idea of writing words on paper and don't just worship it as a religious thing.

There has actually been work on doing that, attempting to mark radioactive waste dumps in a way even someone in a completely different culture from a distant future could understand. It's really interesting and there is pdf on it here: http://prod.sandia.gov/techlib/access-control.cgi/1992/92138...

If you are going to communicate with a completely alien civilization, that is one that can't even understand images, expressing information is even more difficult. My best guess is that you send a message with a really obvious pattern to it, then use that pattern as a basis for sending more information.

For example, send a ton of examples of simple code in a simple programming language, and their output. They can figure out what it means. Then encode your messages in the programming language somehow. Send a simulation of 3 dimensional space and little objects in it interacting, for example.

Somewhat related: http://lesswrong.com/lw/qk/that_alien_message/


A movie might work.

It could possibly be made future-proof by using a flipbook format, assuming we can find a suitably long-lived material to print on.


An animation of what? It would be much more difficult and take way more space for a flip book, and it doesn't contain much more information than a few pictures, let alone a book full of pictures.


I was thinking an actual film, which wouldn't be difficult.

I kept thinking about the "how to tell future people about radioactivity" example and I think it's hard to convey the actual effects, while it may be easy enough to convey that it's dangerous.

A movie would be much more apt at explaining what was there and what the consequences of irradiation are that a few stills.


Once civilization recovers enough they should be able to reconstruct the language from just the corpus, a la Linear B.

For less advanced future generations, given the lack of a universal language I think the best you can do is store copies in many languages and/or translation dictionaries, and hope that at least one of them survives. http://rosettaproject.org/disk/concept/ is an interesting take on this.


It would be interesting if it keeps a version history between each dump file. WP often deletes articles that it finds not notable (these are excluded in dumps) and it would be nice to still retain the last version.

In an odd way, we're going back to days where encyclopedias came in CDs. There was Encarta on a single disc, now we can have a lot more for around 80Gb ( http://xowa.sourceforge.net/requirements.html )


This is an interesting idea. When Wikimedia finalizes an incremental backup solution, it may be possible. They'll release a dump with incremental additions / updates / deletions. You would then have XOWA accept the additions / updates, but ignore all the deletions.

It would place more responsibility on the user to maintain their copy of the dump though.


That's true. The burden then shifts to the user. But in a way, that's also good because then you can choose which snapshot to follow.

It's a bit like maintaining your copy of an OS. You can stick to the "stable" branch or, if you're feeling adventurous, you can switch to "release". If you're really into the bleeding edge, you can go with the "nightly" build.

All-in-all, I really like this.

One concern I have is the possible increased bandwidth load for WP. Maybe you can include a small icon or notification to support it by donations. Couldn't hurt to have one there for yourself as well.


Source control is an interesting analogy. In the same vein, when a user syncs their version with the main branch, there will be hundreds of thousands of changes to review. It'll be pretty harrowing for anyone to figure out what to keep / reject. Just something to consider.

Anyway, thanks for the food for thought as well as your suggestion. I added donation links for archive.org and wikipedia tonight.


What's the main target market for this kind of thing?


Perhaps not the main market, but prisons.

Since March 2013, inmates of the Bellevue prison in Gorgier, Switzerland can request access to an uncensored offline copy of the French Wikipedia, based on the Kiwix software. Internet access is severely restricted for the prisoners, most of whom serve long-term sentences [1]

[1] http://meta.wikimedia.org/wiki/Wikimedia_Highlights,_June_20...


I personally use it for traveling. There are a few other applications as well:

- low-bandwidth availability, particularly in less-developed regions of the world

- censorship evasion

- security concerns. some users want to access Wikipedia without exposing their machine to the internet

There are probably a few others I'm missing....


"censorship evasion..."

Unfortunately, having your own copy of Wikipedia could also be used to enable censorship. For example, a fundamentalist school could have their own version of Wikipedia from which they've purged all articles about evolution, etc. Then they could configure their firewall to block the real Wikipedia.


Agreed. However, I think it would be less work for them to block access through firewall policy, than to remove them from XOWA.

By and large, for most private individuals, an offline app would allow them to evade censorship. I'd hope that this benefit outweighs the risk of the other's abuse.


A firewall would not hide the fact that censoring takes place. You would have to rewrite content to do that. That might be easier in batch, especially if you are going to do NLP to make cut up sentences grammatical.


Ahh.... That's pretty devious. I was thinking of blocking the entire article, not rewriting content. Still not worth the work IMHO, but who knows what censorship servants would do.


One interesting case I can think of is for institutions which for some reason or another want complete control over article creation/deletion/modification but without the need for recreating the entire knowledge database from scratch.


Schools without internet access; unfortunately those are still millions.


Machine Learning


People living in very remote areas, with no phone/net access like some outback parts of Australia.


Collectors.


Tech Camps like railscamps.org which often don't have internet access.


Zombie invasion fanatics, for when the zombies invade you still have an archive of knowledge.


This is only part of why my local file server has a recent wikipedia dump.


Accidental time travel?


Accidental time travel of my entire house, yes. I really should copy it over to my laptop, as that's a fair bit more likely to be useful if I end up in the past without a source of electricity.


This makes me wonder why we couldn't have a DVCS with encyclopaedia content in it. That would be easy to "pull" offline and update regularly and "push" back up with changes. It'd be easier to distribute content and version it as well. Oh and patch queues could be used to review and edit content.

A local HTTP service or desktop app, DVCS and indexer would do a fine job of this.


I think it's just not feasible/useful for most people. The latest copy of Wikipedia is 42GB, and that's not including images or earlier revisions.


To be honest, a lot of wikipedia is crap.

An abridged version would be a couple of gigabytes perhaps which isn't beyond the realm of possibility. That'd fit nicely on a smart phone/tablet and could be taken somewhere with less than adequate data connections (read most places on this planet).


Good luck with getting the editors to agree which parts are crap.


You can get the page view logs. You can filter out most of the garbage that way.

http://dumps.wikimedia.org/other/pagecounts-raw/


Justin Bieber: 594,757 views in the last month.

Origin of birds: 8,506 in the last month.

Ogden L. Mills (secretary of the US Treasury under Herbert Hoover): 399 views in the last month.

What is in the public interest is not the same as what the public show an interest in. Page views won't necessarily help you filter Wikipedia...


With filtering I was thinking about discarding everything with less then 10 views. I consider all of your examples relevant. Getting rid of pages like "Wikipeida_suxxxs" is the first step.


If those pages exist, it would be awesome if you could notify admins and we'll delete them.

https://en.wikipedia.org/wiki/Wikipedia:CSD

Just edit the page and {{db-nonsense}} {{db-test}} or {{db-vandalism}} as appropriate. :)


For a while I had a 300MB text-only Wikipedia on my (Symbian) phone. Found most of the stuff I looked up. For example, I remember browsing some cocktail recipes.


Text-only versions, even straight from dumps, have had some bad formatting and truncation problems over the years, can't wait to try this. And here's a use case: https://news.ycombinator.com/item?id=6676661


We demand e-ink version + "DON'T PANIC" in large, friendly letters on the cover.

Thanks. :)


http://kmkeen.com/tmp/wikireader1.jpg

Though it is a reflective LCD and text-only.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: