-
Notifications
You must be signed in to change notification settings - Fork 5
perl port of scy's levitation
License
sbober/levitation-perl
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This is a Perl port of scy's levitation. It reads MediaWiki dump files
revision by revision and writes a data stream to stdout suitable for
git fast-import.
The first 1000 pages of the german Wikipedia and all their revisions
(about 390000) can be dumped in about 15 min on relatively moderate
hardware.
Dependencies
------------
You need at least Perl 5.10. The Perl interpreter has to be compiled
with threads support.
You also need a working C compiler for the inline SHA1 C function.
Currently this _must_ be gcc 4.3 callable as 'gcc-4.3'. This will be
fixed soon.
You need the following modules and their dependencies from CPAN:
- Regexp::Common
- Inline
- JSON::XS
- Compress::Raw::Zlib
- Carp::Assert
- CDB_File
- XML::Bare >= 0.44
- Deep::Hash::Utils
Some Linux distributions will already have the first set.
Under Debian / Ubuntu the following command should set you:
sudo apt-get install libregexp-common-perl \
libinline-perl libjson-xs-perl \
libcompress-raw-zlib-perl libcarp-assert-perl
Usage
-----
First, initialize a git repository:
cd /tmp
mkdir blawiki
cd blawiki
git init
Then, "levitate". This is a three-step process:
cat /path/to/blawiki-dump.xml | /path/to/levitation-perl/step1.pl
LC_ALL=C sort rev-table.txt > rev-sorted.txt
/path/to/levitation-perl/step2.pl | /path/to/levitation-perl/gfi.pl
Alternatively, you can just change to an empty directory and call the
"levitate" helper script with a path to a dump as parameter (may be
7z, bz2, gz or xml):
mkdir /tmp/blawiki
cd /tmp/blawiki
/path/to/levitation-perl/levitate /path/to/blawiki-dump...
Lots of progress information is printed to standard error, so it may be
best to redirect that to a file.
Have fun.
About
perl port of scy's levitation
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published