mozilla

Wikifier

Wikifier is a perl script to transform (X)HTML pages to wiki text.

Wikifier is based on regular expressions (for more information on writing regexes in perl, see Appendix C). The advantage is that it can more easily handle ill-formed (X)HTML, as opposed to a parsing based solution. The downside is that it is more difficult to handle nested constructs, like lists inside lists.

The program consists of two parts. Wikifier, the main regex execution engine, and WikifierRules.pm, a collection of regex rules on how to transform a given document.

Get the Code

You can either retrieve the source code from http://n.ethz.ch/student/awuest/projects/wikifier/, or you can copy and paste it from Appendix A below.

Install

  1. Wikifier, the regex execution engine.
    1. Either download or copy the code and save it in a file called Wikifier.
    2. Make the program executable by issuing $ chmod u+x Wikifier.
  2. WikifierRules.pm, the file containing the rules.
    • Either download or copy the code and save it in a file called WikifierRules.pm, in the same directory as the Wikifier file.

Run

Options

Wikifier has several options:

  • -h prints usage information and exits
  • -v prints version information and exits
  • -l lists all available rules and exits
  • -f name of input file
  • -i comma separated list of rules to explicitely include
  • -e comma separated list of rules to explicitely exclude

Example Invocations

  • To apply all existing rules to a document called SomeDocument.html, execute
    $ ./Wikifier -f SomeDocument.html

    The transformed text is written to stdout. To store the output in a file, use simple shell redirection

    $ ./Wikifier -f SomeDocument.html > SomeDocument.wiki
  • To only apply rules 10 and 21, use
    $ ./Wikifier -f SomeDocument.html -i10,21
  • To apply all available rules except 10 and 21, use
    $ ./Wikifier -f SomeDocument.html -e10,20
  • To list all available rules, use
    $ ./Wikifier -l

Bugs and Improvements

If you have found a bug or have a useful improvement (new functionality in the main engine, new regex rules, etc.), either change the code directly inside this article (but please increase the version number of the file you edit by 1 after the dot and document your changes in the changelog in the file header), or send me an email.

Appendix A - Source Code

Wikifier

#!/usr/bin/perl -w

# Wikifier version 1.1 2005-11-29
# written by Andreas Wuest (awuest@gmail.com)
#
# Wikifier is a utility to transform XHTML compliant
# documents to wiki text.
#
#
# The latest version of Wikifier can be retrieved from
# http://n.ethz.ch/student/awuest/projects/wikifier/
#
#
# This script is free software, distributed under the GPL
# http://www.gnu.org/licenses/gpl.txt
#
#
# Changelog:
#
# 2005-11-29 - version 1.1
#       - regexes are now case-insensitive
#
# 2005-11-28 - version 1.0
# 	- initial release

# This file contains the main wikification engine. The
# rules on how the transformation should take place are
# contained within the WikifierRules.pm module file.

package main;

use strict;
use warnings;
use diagnostics;

use WikifierRules;

use Getopt::Std;

use constant VERSION          => '1.1';
use constant YEAROFLASTCHANGE => '2005';


my $src_file;
my @apply_vec;
my $content;

($src_file, @apply_vec) = parse_opts();
$content = internalise_file($src_file);
$content = apply_rules(\@apply_vec, $content);

print STDERR "\n============= Wikified content =============\n";
print $content;

exit 0;


sub parse_opts {
    my $src_file = '';
    my @incl_rule;
    my @excl_rule;
    my @apply_vec;
    my %opts;

    getopts("hvlf:i:e:", \%opts);

    if ($opts{'h'}) {
	usage();
	exit 0;
    }

    if ($opts{'v'}) {
	version();
	exit 0;
    }

    if ($opts{'l'}) {
	list_rules();
	exit 0;
    }

    if ($opts{'f'}) {
	$src_file = $opts{'f'};
    } else {
	print STDERR "ERROR: no input file specified.\n\n";
	usage();
	exit 1;
    }

    if ($opts{i}) {
	@incl_rule = split(/,/, $opts{i});
	# exclude all rules
	init_apply_vector(\@apply_vec, scalar(@WikifierRules::subs_rules), 0);
        # include specific rules
	switch_rules(\@apply_vec, \@incl_rule, 1, scalar(@WikifierRules::subs_rules));
    } else {
	# include all rules
	init_apply_vector(\@apply_vec, scalar(@WikifierRules::subs_rules), 1);
    }
    # exclusion overrides inclusion
    if ($opts{e}) {
	@excl_rule = split(/,/, $opts{e});
        # exclude specific rules
	switch_rules(\@apply_vec, \@excl_rule, 0, scalar(@WikifierRules::subs_rules));
    }

    return ($src_file, @apply_vec);
}

sub init_apply_vector {
    my ($apply_vec, $no_entries, $switch) = @_;

    foreach my $i (0 .. $no_entries - 1) {
	    $apply_vec->[$i] = $switch;
    }
}

sub switch_rules {
    my ($apply_vec, $switch_vec, $switch, $max_entries) = @_;

    # include or exclude specific rules
    foreach my $i (@{$switch_vec}) {
	if ($i <= $max_entries) {
	    $apply_vec->[$i - 1] = $switch;
	} else {
	    if ($switch) {
		print STDERR 'ERROR: inclusion';
	    } else {
		print STDERR 'ERROR: exclusion';
	    }
	    print STDERR " of rule ${i}. Rule does not exist. Aborting.\n";
	    exit 1;
	}
    }
}

sub internalise_file {
    my $src_file = shift;

    my $content  = '';

    open(FHANDLE, $src_file);
    while (<FHANDLE>) {
	$content .= $_;
    }
    close(FHANDLE);

    return $content;
}

sub apply_rules {
    my ($apply_vec, $content) = @_;

    my $pad = count_digits(@{$apply_vec} + 1);

    foreach my $i (0 .. $#{$apply_vec}) {
	if ($apply_vec->[$i]) {
	    print STDERR 'Applying rule ' . sprintf('%0*d', $pad, $i + 1) . ' ("' . $WikifierRules::subs_rules[$i]->[0] . '"): replace "' . $WikifierRules::subs_rules[$i]->[1] . '" by "' . $WikifierRules::subs_rules[$i]->[2] . '"...';
	    $content =~ s/$WikifierRules::subs_rules[$i]->[1]/$WikifierRules::subs_rules[$i]->[2]/ieegm;
	    print STDERR " done.\n";
	}
    }

    return $content;
}

sub list_rules {
    my $pad = count_digits(@WikifierRules::subs_rules + 1);

    print 'Listing available rules (' . @WikifierRules::subs_rules . " total):\n";
    foreach my $i (0 .. $#WikifierRules::subs_rules) {
	print sprintf('%0*d', $pad, $i + 1) . ': ' . $WikifierRules::subs_rules[$i]->[0] . '. Replace "' . $WikifierRules::subs_rules[$i]->[1] . '" by "' . $WikifierRules::subs_rules[$i]->[2] . "\"\n";
    }
}

sub count_digits {
    my $number = shift;
    return $number =~ tr/0-9//;
}

sub usage {
    print STDERR "Wikifier [-h] [-v] [-l] -f inputfile [-i=a,b,c,...] [-e=a,b,c,..]\n";
    print STDERR "    -h prints this message and exits\n";
    print STDERR "    -v prints version information and exits\n";
    print STDERR "    -l lists all rules and exits\n";
    print STDERR "    -f name of file to wikify\n";
    print STDERR "    -i comma separated list of integer rule\n";
    print STDERR "       identifiers to include\n";
    print STDERR "    -e comma separated list of integer rule\n";
    print STDERR "       identifiers to exclude\n";
}

sub version {
    print "Wikifier version " . VERSION . ". Rule file version " . RULES_VERSION . "\n\n";
    print "Copyright (C) " . YEAROFLASTCHANGE . " Andreas Wuest (awuest\@gmail.com)\n\n";
    print "Wikifier comes with NO WARRANTY, to the extent permitted by law.\n";
    print "You may redistribute copies of Wikifier under the terms of\n";
    print "the GNU General Public License. For more information about these\n";
    print "matters, see http://www.gnu.org/licenses/gpl.txt.\n";
}

WikifierRules.pm

<nowiki>
# WikifierRules version 1.3 2005-12-01
# written by Andreas Wuest (awuest@gmail.com)
#
# WikifierRules is an accompanying module to Wikifier,
# a utility to transform HTML compliant documents
# to wiki text.
#
#
# The latest version of Wikifier can be retrieved from
# http://n.ethz.ch/student/awuest/projects/wikifier/
#
#
# This script is free software, distributed under the GPL
# http://www.gnu.org/licenses/gpl.txt
#
#
# Changelog:
#
# 2005-12-01 - version 1.3
#       - added "Rewrite italic tags" rule
#       - added "Rewrite bold tags" rule
#       - unified table environment removal rules
#       - simplified some rules
#       - improved list entry rewriting for malformed syntax
#       - added "Rewrite distinguished <code></code>" rule
#
# 2005-11-30 - version 1.2
#       - added "Add originaldocinfo block" rule
#
# 2005-11-29 - version 1.1
#       - "rewrite external links" rule now also accepts
#         line breaks in the link text
#       - "remove anchors" rule fixed to remove multiline anchors
#       - added "insert blank line before headings" rule
#       - extended heading rewriting
#       - improved overall tag recognition 
#
# 2005-11-28 - version 1.0
# 	- initial release

# This file contains the rules on how the HTML document
# should be transformed.

package WikifierRules;

use strict;
use warnings;
use diagnostics;

our (@EXPORT);
use Exporter qw(import);
@EXPORT = qw(RULES_VERSION RULES_YEAROFLASTCHANGE TAB_WIDTH @subs_rules replace_list_entries normalise_whitespace format_pre);

use constant RULES_VERSION          => '1.3';
use constant RULES_YEAROFLASTCHANGE => '2005';


use constant TAB_WIDTH => 4;

our @subs_rules = (
    # [
    #     'Rule title',
    #     'regexp', '"replacement"'
    # ]

    [
       'Normalise tabs (convert tabs to ' . TAB_WIDTH . ' spaces)',
       '\t', '" " x WikifierRules::TAB_WIDTH'
    ],

    [
        'Extract body',
	'(?:.|\s)*?<body>\s*((.|\s)*)\s*</body>(?:.|\s)*', '"$1"'
    ],

    [
        'Remove <p></p>',
	'<.?p(?:\s.*?)?>', '"\n"'
    ],

    [
        'Remove h1 headings',
	'<h1(?:\s.*?)?>.*?</h1>', '""'
    ],

    [
        'Rewrite headings (h2 -> =)',
	'<.?h2(?:\s.*?)?>', '"="'
    ],

    [
        'Rewrite headings (h3 -> ==)',
	'<.?h3(?:\s.*?)?>', '"=="'
    ],

    [
        'Rewrite headings (h4 -> ===)',
	'<.?h4(?:\s.*?)?>', '"==="'
    ],

    [
        'Rewrite headings (h5 -> ====)',
	'<.?h5(?:\s.*?)?>', '"===="'
    ],

    [
        'Remove section numbers in section headings',
	'=[0-9.]+?[ ]', '"="'
    ],

    [
        'Rewrite italic tags',
	'<i(?:\s.*?)?>\s*((.|\s)*?)\s*<\/i>', '"\'\'$1\'\'"'
    ],

    [
        'Rewrite bold tags',
	'<b(?:\s.*?)?>\s*((.|\s)*?)\s*<\/b>', '"\'\'\'$1\'\'\'"'
    ],

    [
        'Remove anchors',
	'<a name=.+?>(.|\s)*?</a>', '""'
    ],

    [
        'Rewrite image links',
	'<img\s.*?src="(.+?)".*?>', '"[[Image:$1]]"'
    ],

    [
        'Rewrite external links',
	'<a href="(.+?)">((?:.|\s)+?)<\/a>', '"[$1 " .  WikifierRules::normalise_whitespace($2) . "]"'
    ],

    [
        'Remove <blockquote></blockquote>',
	'<blockquote>\s*((.|\s)*?)\s*<\/blockquote>', '"$1"'
    ],

    [
        'Remove table environments',
	'<table(?:\s.*?)?>\s*?<tr>\s*?<td>\s*((.|\s)*?)\s*<\/td>\s*?<\/tr>\s*?<\/table>', '"$1"'
    ],

    [
        'Replace unordered (itemised) lists',
	'<ul(?:\s.*?)?>\s*((.|\s)*?)\s*<\/ul>', 'WikifierRules::replace_list_entries("*", $1)'
    ],

    [
        'Replace ordered lists',
	'<ol(?:\s.*?)?>\s*((.|\s)*?)\s*<\/ol>', 'WikifierRules::replace_list_entries("#", $1)'
    ],

    [
        'Remove leading whitespace',
        '^.*?(<pre(?:\s.*?)?>(?:(?:.|\s)*?)<\/pre>)|^[\t\f ]+(.*)$', '"$+"'
    ],

    [
        'Remove blank lines between headings and the first paragraph',
	'^(=.*?=)\n+', '"$1\n"'
    ],

    [
        'Remove <div></div>',
	'<div(?:\s.*?)?>\n*((.|\s)*?)\n*<\/div>', '"$1"'
    ],

    [
        'Remove trailing whitespace',
	'[\t\f ]+$', '""'
    ],

    [
        'Insert blank line before headings',
	'^(=.+?=)$', '"\n$1"'
    ],

    [
        'Remove multiple continuous blank lines',
	'^\n\n+', '"\n"'
    ],

    [
        'Remove <pre>

', '

\n*((?:.|\s)*?)\n*<\/pre>', 'WikifierRules::format_pre($1)'
    ],

    [
        'Rewrite distinguished ',
        '\n\n\n*((?:.|\s)*?)\n*<\/code>\n\n', ' $1'
],

[
'Add breadcrumbs',
'\A((.|\s)*)\Z', '"

Document Tags and Contributors

Contributors to this page: Andreas Wuest, newacct
Last updated by: newacct,