Monday, December 2, 2013

National Library of the Netherlands - Migrating multilingual xHTML from a shoe-box to Drupal 7

In fact I am a very lazy person. That's why I studied theoretical physics. In school I simply compared the number and thickness of the books for the different studies, and concluded that physics must be the one where this laziness would prosper most.
Feynman diagrams
The thing is, as soon as you understand a part of physics, you can stop reading. Not so with history, not so with all other subjects. They require hard mental labour.
(I have to find some apology for studying philosopy lateron, so any suggestions are welcome....)

Hail Laziness
Laziness is one of the least respected human inclinations.  If there where no lazy people, no one would have invented the car, the combine, the phone et cetera. We would all be working our asses off all day in the fields, pushing little grains out of their stalks one by one. Books would have no indexes. Skyscrapers no elevators. Chairs no legs.

Royal Library of the Netherlands
One fine day (early 2012) we got like the following request from the Royal Library.

On the left we've got twenty shoeboxes of HTML, and on the right we have a brand new core structure for the website in Drupal 7: Contenttypes, Display Suite, Views and all.
Could you please import the information from the shoeboxes into the Drupal 7 system. Thankyouverymuch and Yesterday would be fine.

Being Lazy we simply said: Yes, of course we can. There is nothing we can't do.
And then imagine the soundtrack of Jaws on the background........

Delusions of simplicity
We said so because the concept in itself is very simple.

      Here we have content A, B and C. In booklet X.
      There, on the other side, is a new booklet Y.
      Now copy A, B and C to booklet Y.

      X( A B C )         -->        Y( A B C )

What the hell were you thinking!
The source being ante-diluvian xHTML, more or less templated,  from which we had to sift images, downloads and all other digital objects, parse all content from all these different structured pages, and then load and upload everything in the totally different structured Drupal 7 system.

Book of Hours of Simon de Varie. Paris, 1455;  Tours, c. 1455.
Vellum, 99 leaves, 116 x 85 mm. - 74 G 37a, fol. 1v-2r
We're talking repeatable fieldcollections in repeatable fieldcollections,  migrating source images to different caching formats, programmatically generating en configuring viewmodes, and showing them via the Display Suite and the Media module, which at the time was still in the development phase. And apart from that, there was still development going on the the target Drupal site.
Oh Yeah, and did I mention the multilingual part? English and Dutch.

So it's a project with a big red rubber stamped warning on the cover:

      Send project manager to 
      high altitude trainingcamp 
      before start!

Or, stated differently, in common ICT jargon: Blood, Sweat and Tears ahead.

Analysis: Dance like a butterfly...
The approach we used was the equivalent of dance like a butterfly and sting like a bee in webdevelopment. There is no way to oversee all problems in this kind of large projects concerning conversions between two totally different systems. In the quotation we made, which needed to be fixed price, we tried to include some bleeding control, but we knew beforehand that we were entering the Swamp part of our Portfolio. 
It proved for instance especially difficult to emulate the maintenance features in the old system, which was an xml based editor. There was no way we could migrate these features to the new D7 setup.  We were down to analysing every feature and  approach it from a totally different angle in D7.
Another unexpected tricky part was recalculating all internal links in the site. We had to iterate all pages aproximately twenty times before we had all link types covered. Some linkit, some entityreferences, some link fields, et cetera.
Don't get me started...
For the sake of your mental health I will not elaborate on the technical details, anyone interested can contact me directly, but it's a lot like this
    $file2 = file_save_data($image2, 'public://'.$filename2,FILE_EXISTS_RENAME);
    $field_collection_item->field_afbeelding=array(LANGUAGE_NONE => array('1' => (array)$file2));

    $field_collection_item->field_alttekst[LANGUAGE_NONE][0]['value'] = $animageobject->alt;
    $field_collection_item->field_titletekst[LANGUAGE_NONE][0]['value'] = $animageobject->title;

and than for days on end....
(And as for the bleeding part: Whoever thought up the idea of littering PHP code with all these dollar signs?)

5% inspiration, 99% iteration
I think we did about ten migrations. And after every migration the Royal Library checked the new site and discovered some new features or parts that weren't specified but certainly needed to be migrated as well. The Math was never right!

Never again!
And when you are finaly finished, and the site is migrated and the Royal Library is happy, the first three days you say to yourself: Never again. Behind every mountain there was another mountain, behind every tree there was a madman with an axe, behind every solution there appeared a brand new grinning unexpected problem produced by the very same solution, with five other problems stacked up his sleeve.
But on the fourth day you realise how much fun we had, how much we learnt about the internal workings of Drupal.

Geek sucks
I sometimes mind the Geeky part of my work, days of trying to get a certain piece of code right, minimal pixel adjustment feuds with the designers, that stuff. But when the subject matter itself is something like the Royal Library content, or digital heritage projects in general, I might just possibly be able to live with that.