Technology

Migrating DokuWiki to WordPress

A few hours ago, I posted about limiting the list of “parent” pages in WordPress’ page attribute metadata box. I mentioned that I have plans of automating the conversion of the DokuWiki entries in the existing dictionary into pages in this WordPress installation.

One of the first search results on migrating DokuWiki entries to WordPress pages is Beau Leben’s blog entry.

I modified the code that he posted on his website, adjusted to my situation where I could just simply hard-code the page ID of the upper-level page.

Now what lives at https://attyv.com/law-dict/ is already powered by WordPress. The list of entries starting with ‘A’ could be found at https://attyv.com/law-dict/a/, the list of entries starting with ‘B’ is at https://attyv.com/law-dict/b/ and so on.

Note that the Isles Dictionary of Philippine Law will be perpetually a work in progress; I am not following a predetermined plan how to put entries therein.

While at DokuWiki, each index page produces its list using the following code (note that it utilizes the catlist plugin):

====== Terms starting with letter 'Z' ======

The following terms start with letter 'Z'. Click on one of the links below to see the definition:

<catlist -noHead -sortAscending -sortByTitle -columns:1 -excludeOnName -exclupage:"^[a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y]" -exclupage:"^z$">

Here’s the code converting DokuWiki entries starting with ‘Z’ to WordPress:

<?php
/**
* Super rough and tumble WordPress import script for Dokuwiki.
* Based on a very old DW install that was using the default theme. Probably won't work for anything else.
* You will want to change some things if your wiki is installed anywhere other than /wiki/.
* Also check out the wp_insert_post() stuff to see if you want to change it.
*/
 
require 'wp-load.php';
require_once ABSPATH . 'wp-admin/includes/post.php';
 
// List of Index URLs (one for each namespace is required)
// These will be crawled, all pages will be listed out, then crawled and imported
$indexes = array(
    'https://attyv.com/law-dict/z',
);

$page_parent = 2329;
 
$author = 1; // The user_ID of the author to create pages as
 
function dokuwiki_link_fix( $matches ) {
    return '<a href="/' . str_replace( '_', '-', $matches[1] ) . '" class="wikilink1"';
}
 
$imported_urls = array(); // Stuff we've already processed
 
$created = 0;
foreach ( $indexes as $index ) {
   echo "Crawling $index for page links...<br/>";
    $i = file_get_contents( $index );
 
    if ( !$i )
        die( "Could not download $index\n" );
 
    // Get index page and parse it for links
    //preg_match( '!<ul class="idx">(.*)</ul>!sUi', $i, $matches );
    preg_match( '!<ul style="-webkit-column-count:(.*)</ul>!sUi', $i, $matches );
    preg_match_all( '!<a href="([^"]+)" class="wikilink1"!i', $matches[0], $matches );
    //echo '<pre>';
  //  print_r($matches); die();
   // echo $index;die();
 
    $bits = parse_url( $index );
    //print_r($bits); die();
    $base = $bits['scheme'] . '://' . $bits['host'];
 
    // Now we have a list of root-relative URLs, lets start grabbing them
    foreach ( $matches[1] as $slug ) {
        $url = $page = $raw = '';
 
        if ( in_array( $slug, $imported_urls ) )
            continue;
        $imported_urls[] = $slug; // Even if it fails, we've tried once, don't bother again
 
        // The full URL we're getting
        $url = $base . $slug;
        echo "  Importing content from $url...<br/>";
 
        // Get it
        $raw = file_get_contents( $url );
        if ( !$raw )
            continue;
 
        // Parse it -- dokuwiki conventiently HTML-comments where it's outputting content for us
        preg_match( '#<!-- wikipage start -->(.*)<!-- wikipage stop -->#sUi', $raw, $matches );
        if ( !$matches )
            continue;
 
        $page = $matches[1];
 
        // Need to clean things up a bit:
        // Remove the table of contents
        $page = preg_replace( '#<div class="toc">.*</div>\s*</div>#sUi', '', $page );
 
        // Strip out the Edit buttons/forms
        $page = preg_replace( '#<div class="secedit">.*</div></form></div>#sUi', '', $page );
 
        // Fix internal links by making them root-relative
        $page = preg_replace_callback(
            '#<a href="/wiki/([^"]+)" class="wikilink1"#si',
            'dokuwiki_link_fix',
            $page
        );
 
        // Grab a page title -- first h1 or convert the slug
        if ( preg_match( '#<h1.*</h1>#sUi', $page, $matches ) ) {
            $page_title = strip_tags( $matches[0] );
            $page = str_replace( $matches[0], '', $page ); // Strip it out of the page, since it'll be rendered separately
        }
        elseif ( preg_match( '#<h2.*</h2>#sUi', $page, $matches ) ) {
            $page_title = strip_tags( $matches[0] );
            $page = str_replace( $matches[0], '', $page ); // Strip it out of the page, since it'll be rendered separately
        } else {
            $page_title = str_replace( '/law-dict/', '', $slug );
            $page_title = ucwords( str_replace( '_', ' ', $page_title ) );
        }
        
       // echo $page_title; die();
        //echo $page; die();
    
        // Get last modified from raw content
        preg_match( '#Last modified: (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})#i', $raw, $matches );
        $last_modified = $matches[1];
 
        // Resolve parent if we're in a namespace
        $slug = str_replace( '/law-dict/', '', $slug );
           // echo $slug; die();
        $slug = str_replace('_', '-', $slug);

 
        $post = array(
            'post_status'   => 'publish',
            'post_type'     => 'page',
            'post_author'   => 1,
            'post_parent'   => $page_parent,
            'post_content'  => $page,
            'post_title'    => $page_title,
            'post_modified' => $last_modified,
            'post_name'     => str_replace( '_', '-', $slug ),
        );
 
        wp_insert_post( $post );
        $created++;
    }
}
 
echo "\nDone! Created $created pages in WordPress, based on your Dokuwiki install.\n";

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.