So I’m taking the slow, iterative process of helping my friend Mark convert his site to WordPress. The whole thing is a mission of mercy. After hearing him groan plaintively for over a year about the drudgery of changing the design of his site by hand, I decided that it was high time to have an intervention.
See, his website is completely Web 1.0. I mean it’s totally, irrefutably, undeniably a monument to the pinnacle of web design technology circa 1997. I mean it’s got everything. Tables used as layout elements; <big> tags; <font> tags; inline styles applied on spans of text without any kind of selectors; repeated nonblanking spaces used for horizontal spacing; repeated <br> tags used for vertical spacing; tables within tables within tables (there’s a design element on the left border that is made to look like a film strip, but it’s actually a highly-nested table set); 80% of the page weight is used for stylistic, non-semantic markup. Yeah, it’s got it all.
I will give the guy his due, though. When he started the site so many years ago, 1) he knew little about HTML, so 2) when he finally found a template he liked (most likely a template supplied by whatever editor he used), he touched it up here and there to personalize it, 3) saved the template, and then 4) used that template for every page on his site. Meaning he’d open the template file, scroll down into the markup, insert his page title in the right place, scroll some more, insert the date, scroll down some more, and begin writing his content. Then he’d save it to a new HTML file, upload it to his site with FTP, and then link it by hand into his index pages. Biff, bam, boom.
That kind of workflow was the state of the art in the mid to late nineties. It was par for the course for those of us who, at the time, had neither the knowledge nor the access to make dynamically-generated websites. It was all hand-work with text editors, or for those lucky enough, a pricey WYSIWYG site editor like Pagemill. But for most of us, our only salvation was that our websites were marvelously sparse and simple. At my worst, with my redesign of “The Farm”, a site I maintained while in North Carolina, I had a site with less than 50 pages; ridiculously easy to maintain.
But how about 866 pages?
This is the number of unique HTML files Mark has to maintain. In order for him to, at the least, convert his site to a place where he can simplify the markup and rely on external stylesheets to completely change the appearance (a la CSS Zen Garden), he’d have to individually open 866 HTML pages, find the content in that tag soup, copy the text, open the new template, paste the text, save that to a new file, and link it into his new site. Screw that.
And that’s where I stepped in. Coming fresh from the kill on my own site’s conversion, I took this as a chance to help a brutha out. Granted, my entries with my old Sojournal engine were just lines in a database instead of hardcoded HTML, but it was still ridiculously trivial to write the conversion scripting to go from a DB dump to a WordPress import file. So converting Mark’s site to something more modern (like WordPress) was already half-done. I just needed to get all the data together somehow.
Our original plan was to build a generic template text file into which he could paste the page’s data. I’d then write the scripting to iterate through all those savefiles and cull the data from them into a CSV, which would then be used as input to my CSV-to-Wordpress Import File conversion tool. And then we’d import that into a working WordPress install. Biff…Bam…
Boom. I didn’t count on there being 866 files for Mark to pore through, doing the copy-paste, copy-paste dance. That’s not a homework assignment; that’s punishment. That’s insanity. So after doing 25 pages, he told me as much. And I considered it, and decided that yes indeed it was insanity. So we changed our tactic.
I took a moment to investigate a random sample of his pages, digging deep into the markup to find any similarities. What I found was that, quite remarkably, Mark stuck to the same template for the entirety of his site’s existence with very few exceptions. That became very helpful indeed, because that meant his site, and most of its 866 pages, were easily scrapable, meaning I could write scripting to open each page, navigate to where the data is supposed to be, pull it, clean it up, and dump it to a CSV for me automatically. Boom. Boom. Boom!
So far, this has been the part of the whole project that’s taken me the most time. Webscraping is a common practice; it’s the easiest way to gather data from websites that don’t offer it up in more convenient forms. Technically, this is my first webscraping project ever, and learning the concepts and the tools to do it took time. But now I know how to do it, and dammit, now I wish I’d done it years ago. So incredibly useful.
My tool of choice this time around was to build the scraper in Ruby; there’s a module for Ruby called Nokogiri which makes quick work of grabbing page content. You can parse HTML markup and search page nodes either by XPath selectors or by CSS selectors. You can manipulate nodes, change their attributes, move them around, remove them, add them back, give them content, you name it, Nokogiri can do it. And this became my swiss-army knife.
After building the basic scripting, I spent a week or so doing test runs on the entire pageset, taking note of the weird corner cases where either things weren’t where I expected, or where the markup varied just enough that the cleanup and normalization code didn’t do its job right. But finally, last weekend, I was satisfied. I produced a final run on Mark’s entire site with each page broken out into their pertinent bits: filename, date posted, title, content, image thumbnail filename; generated some extra data such as permalinks. I even added code to convert all of the embedded links between pages to their future WordPress locations in an effort to maintain internal consistency (I’m proud of this part). Finally, I converted that CSV into a spreadsheet for easier manipulation in OpenOffice, fixed some obvious problems (missing titles, no page content, etc), color coded where Mark needed to provide more data (like categories, tags, and some empty titles), and sent him the worksheet.
The ball’s now in his court, so I’m taking some time off. The project’s about 65% done. Soon he’ll be done with his homework and I’ll work on importing it. Hopefully we can get everything up and running in short time. Then, he can take as long as he wants on mangling his WordPress templates while redesigning his site.
That’s his job.