Artwork for podcast Tag1 Team Talks | The Tag1 Consulting Podcast
Unraveling the Extract, Transform, Load ,(ETL) Data Migration Process: Taking a Deep Dive on Load
Episode 11224th January 2024 • Tag1 Team Talks | The Tag1 Consulting Podcast • Tag1 Consulting, Inc.
00:00:00 00:25:19

Share Episode

Shownotes

Join us in this lively episode of Tag1 Team Talks, where our seasoned experts - Mike Ryan (co-creator of Migrate) and Benji Fisher (maintainer of Migrate) and host Janez Urevc, unpack the final "Load" segment of the ETL (Extract, Transform, Load) process crucial for Drupal migrations. As they dive into the nitty-gritty, they shed light on the intriguing core mechanics of loading data into Drupal, bringing to the fore the remarkable pluggability of Drupal's migration system. Mike Ryan reveals how diverse destination plugins can turn this system into a migration powerhouse.

At the heart of our discussion, the essence of performance optimization isn't lost as our experts explain the magic behind handling one entity at a time during migration. The conversations are peppered with insights that make the daunting migration task feel like a breeze.

We also feature upcoming engaging talks aimed at easing the community's transition from Drupal 7 to Drupal 10. Dive in! This will will be an enlightening ride filled with hearty laughs and profound takeaways!

Transcripts

Speaker:

Welcome to Tag1 Team Talks, brought to you by Tag1 Consulting.

Speaker:

With Drupal 7 and Drupal 9 rapidly approaching end of life, we are

Speaker:

hearing people talk about migrating and upgrading more than ever before.

Speaker:

And anyone who's ever been involved with the large scale migration,

Speaker:

migrating a large site or application from one technology stack to another,

Speaker:

will tell you that it's complex, time consuming, and it demands expertise.

Speaker:

That's why we are bringing you this series of talks, diving deep into

Speaker:

the world of Drupal migrations, and who better to guide us than Tag1's

Speaker:

very own Drupal migration experts.

Speaker:

From the masterminds and maintainers of Drupal's migration tooling, to the

Speaker:

individuals behind the most groundbreaking Drupal migrations, we've got an all

Speaker:

star lineup who'll cover everything you need to know about every aspect

Speaker:

of migrating large scale applications.

Speaker:

This team talk is part of the three part series about ETL, Extract,

Speaker:

Transform, and Load process, which is used by many enterprise migration

Speaker:

systems, Drupal's Migrate included.

Speaker:

In today's episode, we're going to talk about how to use Drupal's

Speaker:

Migrate system to Load data into the destination system, which is usually

Speaker:

Drupal, but not necessarily always.

Speaker:

Be sure to stick around to the end because we're, uh, also going to announce

Speaker:

the next few talks in our series.

Speaker:

Let's dive in.

Speaker:

I'm Janez Urevc, senior engineer here at Tag1 and a long time contributor to

Speaker:

Drupal, and I'm joined today by well known top contributors to Drupal, Benji

Speaker:

Fisher, one of the five current Drupal Migrate core subsystem maintainers,

Speaker:

and Mike Ryan, co creator of Migrate.

Speaker:

Welcome, and thank you for joining me.

Speaker:

In case you didn't already watch or listen to the previous two episodes in

Speaker:

this series about E, Extract, and T, Transform, I'd suggest that you do so.

Speaker:

In the first episode, we, among other things, provided a high level

Speaker:

overview of what ETL stands for.

Speaker:

So, we're not going to cover this today, and we will dive directly

Speaker:

into today's topic, which is...

Speaker:

L, Load.

Speaker:

So Benji, could you tell me and to our audience, of course, what

Speaker:

is being done as part of the Load process and how specifically

Speaker:

is that done in Drupal Migrate?

Speaker:

Sure, so almost all of the time, um, the load phase is

Speaker:

going to be creating entities.

Speaker:

Um, Drupal is structured with, um, you know, a fairly consistent content model.

Speaker:

So we can have configuration entities and we can have content entities.

Speaker:

So content entities are things like taxonomy terms, nodes, users.

Speaker:

Um, and, and we'll be creating each of those in a separate migration.

Speaker:

Um, each migration should have just a, a single destination type, so

Speaker:

you'll want separate migrations in the same project, one, one for each.

Speaker:

type of entity you're creating.

Speaker:

Um, the configuration entities are often settings.

Speaker:

Um, but, uh, but blocks, for example, each block is a configuration entity,

Speaker:

and you could have one, one migration in your project to create block entities.

Speaker:

Um, But there, uh, there are other things, those are the most common ones,

Speaker:

but, um, the entire migration system is pluggable, the, um, in the load phase

Speaker:

we have destination plugins, and there are alternatives to, um, to entities,

Speaker:

you might be creating a custom database table, um, and there's, uh, I think one

Speaker:

example we'll be talking about later where we're actually migrating into

Speaker:

the Drupal state system, which, uh, if you drill down a little bit, turns out

Speaker:

to be the key value table in Drupal.

Speaker:

One would think that, you know, Drupal migrates will only need one destination

Speaker:

for its migrations, which is Drupal.

Speaker:

Um, but Mike, you, when you were designing Migrate, you decided.

Speaker:

To make it pluggable in general, but also make destinations pluggable.

Speaker:

Um, can you talk about the reasoning, like why, why you left the door open

Speaker:

to, to run migrations that, that store data into anything, basically?

Speaker:

Well, um, well, the main thing was that at the time we originally developed Migrate,

Speaker:

which was Drupal 6, uh, basically every module that managed content in Drupal had

Speaker:

its own database table, its own schema.

Speaker:

There was no general purpose entity system.

Speaker:

So basically each type of data in Drupal needed its own, uh, destination plugin.

Speaker:

Um, and.

Speaker:

So, um, it was just natural to use the same sort of plugin system

Speaker:

as we're using for the extractor.

Speaker:

Um, and of course, once you've got that flexibility, you can start

Speaker:

to think of other ways to use it.

Speaker:

For example, you can have a CSV, uh, destination plugin to export

Speaker:

data using the migration system.

Speaker:

If you want to pull, um, pull a Drupal data out into, uh, format important

Speaker:

to something else, you can write a migration that extracts your Drupal data,

Speaker:

transforms it into the, um, proper, uh, format and then loader that dumps it.

Speaker:

And even, even inside Drupal now we have, as far as I know, different classes

Speaker:

for different entity types, right?

Speaker:

Like different destination classes.

Speaker:

For noise for taxonomy terms.

Speaker:

All right.

Speaker:

There there is a, you know, they're built on a general, um, into the

Speaker:

destination, but many, um, there is, of course, a big difference between

Speaker:

content entities and configuration entities, but also even among those.

Speaker:

A number of the, um, of entity types need a little bit of special handling.

Speaker:

Uh, for example, users, you have to deal with, uh, passwords.

Speaker:

So the person writing the migration doesn't necessarily know whether

Speaker:

there's any special processing.

Speaker:

They know they're creating...

Speaker:

Entity types of type user.

Speaker:

So they say content entity, colon user.

Speaker:

And if there is special processing, there'll be a special, um, load

Speaker:

class to, to manage that, and if not, it'll just fall back to the

Speaker:

generic content entity of type user.

Speaker:

Right.

Speaker:

And, um, we, we haven't touched on that so far, but perhaps we should that most

Speaker:

of your migration logic as such as it is will be implemented in simple YAML files,

Speaker:

basically migrations are configuration.

Speaker:

Um, if the existing plugins serve the needs for your particular, uh, migration.

Speaker:

Application, then basically you just write a bunch of YAML and made in

Speaker:

serious migrations, you often need to write an occasional, um, transformer

Speaker:

in PHP, but most of your work is just YAML, so it's very readable.

Speaker:

Very simple to put together.

Speaker:

That isn't really part of the load talk.

Speaker:

That should, but we should cover it somewhere.

Speaker:

Um, when we were preparing for this episode, you also mentioned that, um, like

Speaker:

the fact that we are, when we are storing entities, we do one entity at a time,

Speaker:

and there are very good reasons for that.

Speaker:

Can you also talk about that a little bit?

Speaker:

Well, the whole pipeline, the, uh, handles one entity or one

Speaker:

logical piece of data at a time.

Speaker:

And, um.

Speaker:

There are multiple reasons for that.

Speaker:

One big one is to handle references between, uh, entities, let's say, if you

Speaker:

have a link from one node to another or a link from a node to a taxonomy term.

Speaker:

And historically these, um, the unique identifiers for Drupal entities have been

Speaker:

Serial, uh, fields, um, serial numbers.

Speaker:

And when you're creating new entities on your new system, it's, you're,

Speaker:

you're going to end up with new numbers.

Speaker:

If you really, really insist on it and work hard at it.

Speaker:

You may be able to preserve your IDs, but this is, it's not recommended.

Speaker:

It's much simpler to, um, rewrite the references and to rewrite the references.

Speaker:

You can't do it in bulk because you won't know the new reference number.

Speaker:

The pipeline does it one at a time so that you migrate one entity, it gets its

Speaker:

new number, we keep track of the mapping from its old ID to its new ID, and then

Speaker:

when it's time to migrate the reference.

Speaker:

We can fill in the new ID and everything is still pointing

Speaker:

where it's supposed to be.

Speaker:

Um, you know, we will talk, we talked more about this in the

Speaker:

Transform talk before this.

Speaker:

And another, I'm sorry.

Speaker:

Yeah, go ahead Benji.

Speaker:

Yeah, another reason to do it one entity at a time is that we want to leverage

Speaker:

the other APIs that Drupal provides.

Speaker:

The Entity API does not give us a method for creating 10 entities at a time.

Speaker:

It gives us methods for creating one entity at a time.

Speaker:

So just for to make, to manage the complexity of the Migrate API in

Speaker:

Drupal core, that's a second reason for, for doing it one at a time.

Speaker:

Now, in particular cases, um, if you've got a huge number of things

Speaker:

that you're creating and you're, uh, you know, that your migration

Speaker:

is going to take hours or days.

Speaker:

Um, and you know that this particular part of your project isn't going to

Speaker:

require the sort of references that Mike was talking about, then on a particular

Speaker:

project, it might make sense to have, um, some, some custom code, a custom

Speaker:

destination plugin, for example, that does batch things to 10 or 100 at a time.

Speaker:

Um, but that's not going to go into Drupal core.

Speaker:

Because it won't always work and it would be a lot of added complexity.

Speaker:

And, uh, one other reason to deal with one entity at a time is, uh, memory.

Speaker:

If you've got a lot of data, you don't want to deal with the whole batch of data.

Speaker:

At once, and we are, uh, the, the migrate system is very performance conscious.

Speaker:

It's got some built in memory, um, uh, sort of, uh, I'm not sure what you

Speaker:

would call it, but it, it would, it will recognize if you're running low on memory.

Speaker:

And, um, do some purging of internal caches and so on as needed to keep going.

Speaker:

And if you're using, um, Drush to run your migrations as you should, it

Speaker:

can, if necessary, um, respawn a new process, fresh process, if the, uh, if

Speaker:

it's unable to reclaim enough memory.

Speaker:

So Migrate will do that automatically, I mean Drush and Migrate will do

Speaker:

that automatically behind the scenes without developer initiating it?

Speaker:

Yes.

Speaker:

That's great.

Speaker:

We are planning to have a talk on performance and I'm sure that we will

Speaker:

talk about these sorts of things, uh, in detail in that episode.

Speaker:

Um, Benji already mentioned core versus contrib.

Speaker:

So what, what do we have in core?

Speaker:

In terms of, uh, the load step and which interesting other things

Speaker:

could we find in contrib space?

Speaker:

So Core has, uh, support for migrating from Drupal 6 or Drupal 7 into modern

Speaker:

Drupal, and so mostly that means entities.

Speaker:

So nodes, taxonomy terms, users.

Speaker:

Blocks, um,

Speaker:

and, and then, um, in contrib space, um, the sort of the, the most esoteric

Speaker:

example I know of is a module called Commerce QuickBook WebConnect, which

Speaker:

uses SOAP to, um, import data from QuickBooks into Drupal and to export

Speaker:

data from Drupal to QuickBooks.

Speaker:

And Lucas, Hedding is one of the maintainers of that module

Speaker:

and it's, it's, it's lightly used and I don't think there's a

Speaker:

Drupal 10 compatible version yet.

Speaker:

Um, but I used it on a recent project.

Speaker:

And, and looked at it and it's, uh, it's very interesting in the way it

Speaker:

uses migrate to export data from Drupal.

Speaker:

And I, I think Mike, uh, suggested how this works earlier, but it, it goes

Speaker:

through the, um, the orders one at a time and the, the load plugin it,

Speaker:

it uses, or, or the, the destination plugin it uses for the load stage,

Speaker:

um, exports data about a commerce order into the Drupal State system.

Speaker:

Um, and then, um, a, it, it, it cuts off the migration after processing one row

Speaker:

and then another part of the module takes over and extracts the data from the state

Speaker:

system and generates a soap response.

Speaker:

which then gets batched somehow.

Speaker:

So, um, so as Mike said, you, you, you could be exporting

Speaker:

to a CSV file or something.

Speaker:

In this case, we're exporting to the state system.

Speaker:

And then other parts of the module use that to get the data into QuickBooks.

Speaker:

Um, less esoteric than that.

Speaker:

Um, the only other sort of general purpose, uh, destination plugin I

Speaker:

know is in the Migrate Plus module.

Speaker:

There's, uh, an explicit, uh, destination plugin for a custom SQL table.

Speaker:

So if you have custom database tables in your project that you

Speaker:

need to migrate, um, you can use the, uh, SQL table plugin from.

Speaker:

Uh, migrate plus and that that doesn't use the entity system.

Speaker:

It just writes directly to the SQL table.

Speaker:

Um, I want to go back to the QuickBooks a little, a little bit, because I find

Speaker:

this approach of migrating into state system and then doing SOAP requests.

Speaker:

Um, very interesting.

Speaker:

Do you, do you know why it was designed this way or should we get Lucas on,

Speaker:

on team talks to explain us that?

Speaker:

So I, I've never asked him about it.

Speaker:

Um, and he, he wasn't the original author of the module, but, uh,

Speaker:

but he, he did some work on it.

Speaker:

Um, but I'm, I'm pretty sure that the, uh, the reason they decided to

Speaker:

use the migrate API for that is that it gives a way of tracking, um, the

Speaker:

original entity ID and the exported ID.

Speaker:

So Drupal has, um, as Mike mentioned, sequential IDs for each order.

Speaker:

QuickBooks has its own way of keeping track of the orders.

Speaker:

And, uh, the Migrate API provides a system for keeping track of which Drupal

Speaker:

ID corresponds to which QuickBooks ID.

Speaker:

And that's, uh, that's one, one of the, um, reasons for using, um, the

Speaker:

Migrate API if you do need to keep track of, uh, of old and new IDs.

Speaker:

Then that that's one argument for using migrate API rather than just some sort

Speaker:

of custom code for, for exporting data.

Speaker:

That's a great point.

Speaker:

I didn't think about it.

Speaker:

Um, so I remember when, uh, we were still like the Drupal community

Speaker:

was still developing Drupal 8.

Speaker:

Um, it's been.

Speaker:

You know, quite a long process and, um, also the discussion about including

Speaker:

migrating to core, um, happened at that time and then eventually the decision

Speaker:

that we will use it to migrate from Drupal 7 to 8, um, I remembered that.

Speaker:

Um, back in those days, MongoDB, like the company behind the Mongo database, um,

Speaker:

wanted to be like the first class citizen for Drupal, like providing out of the box.

Speaker:

Support to run your, um, your Drupal site on Mongo instead of MySQL.

Speaker:

And I've been involved with Mongo quite a lot at that time, because I

Speaker:

was working at Examiner and Examiner was using, uh, MongoDB for Drupal 7.

Speaker:

But in Drupal 7, you still had to use, uh, MySQL database.

Speaker:

Next to it.

Speaker:

So you had two databases, two sets of, uh, of backups and all that.

Speaker:

Um, so they wanted to provide the, the ability to, to be Mongo did the sole

Speaker:

database for Drupal 8 and, and on.

Speaker:

And I remember that, uh, Chx, CHX was working on that.

Speaker:

Um, and he was really excited about Migrate because he realized that

Speaker:

if we would be using Migrate as a standard to migrate from D7 to D8.

Speaker:

You would basically just swap the destination plugin and instead of loading

Speaker:

into MySQL, you would load in MongoDB.

Speaker:

Um, but then, then MongoDB company lost interest and, and, and stopped

Speaker:

funding Chx to do that work.

Speaker:

So that work was never completed.

Speaker:

Um, it was like in a very early alpha stage.

Speaker:

And the module is, is still on D.o.

Speaker:

And I'm not sure what state is it at the moment, but, um, A, it would

Speaker:

have been very cool to have this possibility and, um, B again, proves

Speaker:

that, um, having the Load part of the Migrate pluggable is very useful.

Speaker:

Uh, do you two have any other, like.

Speaker:

Unusual or interesting cases related to Lpart, uh, that you've seen in the past

Speaker:

or maybe any ideas how it could be used, but you've not seen it used that way yet.

Speaker:

I almost always create entities.

Speaker:

I don't think I have any other examples of clever uses of the Load stage.

Speaker:

It is a little exotic, you know, beyond.

Speaker:

You know, if you want to export some CSVs for some reason,

Speaker:

Yeah, it could be used for exporting, like similar to how WordPress exports

Speaker:

or precise and XML you could use.

Speaker:

Yeah.

Speaker:

Although views export is probably easier for most of those cases.

Speaker:

That's true.

Speaker:

So I think that that's it for the L part.

Speaker:

Um, this is also the end of the last, uh, episode in our ETL mini series.

Speaker:

Uh, but we have some great team talks lined up.

Speaker:

Uh, our goal is to put out one per week over the next few months to

Speaker:

support the community in the migration process from Drupal 7 to Drupal 10.

Speaker:

Um, and as part of that, we're planning to talk about performance, which is

Speaker:

something we care deeply about at Tag1.

Speaker:

Um, and of course it applies to migrations as well, especially if

Speaker:

you're handling really large data sets.

Speaker:

Um, a full data migration can easily take over 12 hours or even more days.

Speaker:

Um, and we'll do a handful of talks on this topic, including how

Speaker:

to profile and tune a migration.

Speaker:

we'll also do a talk on incremental migrations.

Speaker:

Where you can include or exclude things, uh, and run a migrational subset

Speaker:

of data to make it perform better.

Speaker:

And every project owner wants their migration to be a success.

Speaker:

We will dedicate an episode to discuss the most important factors for a

Speaker:

successful Drupal 7 to 10 migration in order to help successfully

Speaker:

navigate your migration project.

Speaker:

And other topics that we are planning to cover include porting custom code

Speaker:

from Drupal 7 to Drupal 10, uh, the future of migrate tooling, how to

Speaker:

port the team and, uh, so much more.

Speaker:

We, we hope that you'll tune in and enjoy our upcoming team talks.

Speaker:

A huge thank you to the Tag1 Team.

Speaker:

Thank you, Benji

Speaker:

Fisher

Speaker:

and Mike Ryan.

Speaker:

Um, make sure that you check out the other segments in this series.

Speaker:

There will be links to them in the show notes, along with all the

Speaker:

other links that we mentioned today.

Speaker:

If you like this talk, please remember to upvote, subscribe, and share it.

Speaker:

Uh, you can check our past talks at tag1..com/ttt.

Speaker:

That's three Ts for Tag1 Team Talks.

Speaker:

As always, we'd love to hear your feedback and any topic suggestions.

Speaker:

You can write us at TTT@ Tag1.Com.

Speaker:

A big thank you to both of our guests and to everyone who tuned in.

Speaker:

Thank you for joining us.

Links

Chapters

Video

More from YouTube