Dev/Design/Data dump format
The Wikimedia data dump files are released in compressed forms: bzip2 or gzip. Prior to v0.5.2, XOWA required that the files be uncompressed in order to read them. v0.5.2 allows the user the option to either read directly from the compressed or uncompressed file.
Contents
bzip2: disk space vs speed
Currently, reading from a bzip2 file is much slower than unzipping and reading from the xml file.[1]
For example, using a 10 GB English Wikipedia dump file:
- unzip takes 120 minutes and +40 GB extra disk space. This process includes unzipping to .xml with 7-zip (40 min: 40 GB) and then importing the wiki (80 min)
- bzip2 takes 330 minutes and + 0 GB extra disk space. This process includes reading directly from the .bz2 file (250 min: 0 GB) and importing the wiki (80 min)
If you have the extra disk space, you will want to use the unzip route. If you are low on disk space, then you can use the bzip2 route instead
bzip2: Application install (GUI)
By default, the application install uses the unzip route.
To change it to the bzip2 route:
- Go to Options/Import
-
Change Custom wiki commands to
wiki.download,wiki.import
-
Note: the key step is to remove
wiki.unzip
afterwiki.download
Command-line install
The core_init
build step now has an extra property: src_bz2_fil_
. A sample invocation would be
.add('simple.wikipedia.org', 'core.init').src_bz2_fil_('/home/download/simplewiki-latest-pages-articles.bz2').owner
Note that XOWA can also auto-detect the appropriate file. For example, using a directory of /xowa/wiki/simple.wikipedia.org/
- If a .bz2 file is there, it will use it
- If a .xml file is there, it will use it
- If both a .bz2 file and a .xml file are there, it will use the .xml file. (since the .xml will be faster)
- If neither are there, it will fail
gzip
Currently, gzip is only used for the /category2/ system.
- For application setup, .gz is always used (there is no unzipping)
- For CLI, either .gz or .sql can be used. Note that usage follows the same rules as described above for .bz2 / .xml.
References
- ^ This seems to be a result of Java's lack of support for an unsigned byte data-type, as well as other performance gains from a native C++/C application. (7-zip on Windows; bzip2 on Linux)