Using PHP DOMDocument in Wordpress content filter, instead of regexp

Topics: 

wordpress-logo-stacked-rgb.pngIt's been said that solving a software coding problem by adding a regexp (regular expression) you now have two problems. Basically, regular expressions are a cool idea that's really hard to get right, and then really hard to maintain because it's really easy to forget why you concocted that specific regular expression. It's better to not use regexp's, for code maintainability if nothing else, and find other ways to manipulate text. That's especially true for changing HTML or URL strings because both have such stringent formatting restrictions that it's better to use an HTML or URL parser to construct a data object.

At the moment I'm creating a Wordpress plugin for manipulating external links in content, such as to add rel=nofollow or icons to a link. (see https://github.com/robogeek/wp-nofollow)

That means I've been reviewing both Wordpress plugins and Drupal modules with similar functionality, to see how others have solved these same problems. Most are using regular expressions (PHP's regexp function) to match text, and PHP's str_replace to make changes.

I can think of several potential bugs with this. For example if the same text appears twice in the text, won't str_replace mash the wrong piece of text?

The improved technique I'm recommending is to use the PHP DOMDocument object instead. One uses that class to parse the $content variable, and then you have all the DOM API calls you'd want to manipulate the text. Kudo's to the https://github.com/whyte624/wordpress-favicon-links/ plugin for teaching me this trick.

The outline of your processing filter goes like so:

function xyzzy_links_the_content($content)
{
    try {
        $html = new DOMDocument(null, 'UTF-8');
        @$html->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $content);

// ... process the $html DOM object

        return $html->saveHTML();
    } catch (Exception $e) {
        return $content;
    }
}
add_filter('the_content', 'xyzzy_links_the_content');

With this you have a properly parsed DOM object and you don't have to worry about the encoding of anything. You're manipulating objects, and then when you're done it's serialized back to HTML.

If your processing needs to inspect all "a" tags:

        foreach ($html->getElementsByTagName('a') as $a) {
// ... process each link
        }

If you want to add an attribute to a specific link, like target=_blank

     $a->setAttribute('target', '_blank');

Basically with a DOM object you're free to make any HTML manipulation you want.

To learn more about DOMDocument - http://php.net/manual/en/class.domdocument.php