Unraveling the ETL Data Migration Process - Understanding Transform - Tag1 Team Talk - Tag1 Team Talks

Speaker: 00:00:03

Welcome to Tag1 Team Talks, brought to you by the Tag1 Consulting.

Speaker: 00:00:07

With Drupal 7 rapidly approaching and Drupal 9 already end of life, we are

Speaker: 00:00:12

hearing people talk about migrating and upgrading more than ever before.

Speaker: 00:00:17

And anyone who's ever been involved with a large scale migration, Migrating

Speaker: 00:00:22

a large site or application from one technology stack to another will

Speaker: 00:00:27

tell you that it's complex, time consuming, and it demands expertise.

Speaker: 00:00:32

That's why we're bringing you this series of talks.

Speaker: 00:00:35

Diving deep into the world of Drupal migrations.

Speaker: 00:00:38

And who better to guide us than Tag1's very own Drupal migration experts.

Speaker: 00:00:44

From the masterminds and maintainers of Drupal's migration tooling to the

Speaker: 00:00:49

individuals behind the most groundbreaking Drupal migrations, we've got an all

Speaker: 00:00:54

star lineup who'll cover everything you need to know about every aspect

Speaker: 00:00:59

of migrating large scale applications.

Speaker: 00:01:03

This team talk is part of the three part series about ETL, extract,

Speaker: 00:01:09

transform, and load process, which is used by many enterprise migration

Speaker: 00:01:14

systems, Drupal's Migrate included.

Speaker: 00:01:17

In today's episode, we're going to talk about how to use Drupal's Migrate

Speaker: 00:01:21

system to transform the data before loading it into the Drupal's database.

Speaker: 00:01:25

Be sure to stick around to the end because we are going to announce

Speaker: 00:01:29

the next few talks in our series.

Speaker: 00:01:32

Let's dive in.

Speaker: 00:01:34

I'm Janez Urevc , senior engineer here at Tag1, and a

Speaker: 00:01:37

longtime contributor to Drupal.

Speaker: 00:01:39

I'm joined today by, well-known top contributors to Drupal, Benji

Speaker: 00:01:43

Fisher, one of the five current Drupal Migrate core system maintainers.

Speaker: 00:01:47

And Mike Ryan, co-creator of Migrate.

Speaker: 00:01:50

Welcome.

Speaker: 00:01:51

Thank you both for joining me.

Speaker: 00:01:53

Thanks for having us.

Speaker: 00:01:55

We're glad to have you.

Speaker: 00:01:58

Before we dive in, I would just like to mention that in case you didn't

Speaker: 00:02:01

already watch or listen to the previous episode in this series about

Speaker: 00:02:05

E, Extract, we'd suggest that you do

Speaker: 00:02:08

in that episode, we, among other things, provided a high level overview

Speaker: 00:02:12

of what ETL stands for, so we'll not repeat that in this episode.

Speaker: 00:02:18

Now, finally, let's dive into today's topic, which is T Transform.

Speaker: 00:02:24

Mike, could you tell us what is being done as part of the transform phase

Speaker: 00:02:28

in general and how Drupal does it?

Speaker: 00:02:32

Is it like similar to how other enterprise systems do it or are

Speaker: 00:02:36

there any specialties to it?

Speaker: 00:02:40

One difference from your classic ETL is the classic ETL usually goes in bulk.

Speaker: 00:02:45

You extract all your data into a big blob.

Speaker: 00:02:49

Then you run it through a transformer, which transforms everything.

Speaker: 00:02:53

And then you run it into a loader, which does a bulk load.

Speaker: 00:02:57

ouR approach is to run through the data one logical row at a time.

Speaker: 00:03:04

We say row because most often we're dealing with databases as our sources.

Speaker: 00:03:10

But.

Speaker: 00:03:12

Technically, it could be anything like any form like a web service or CSV.

Speaker: 00:03:19

Basically, we Use the Drupal plugin system.

Speaker: 00:03:25

And you can, for a given pipeline, each field that is being run through

Speaker: 00:03:31

the pipeline um, can go through any number of transformers because

Speaker: 00:03:36

they're plugins, it's very flexible.

Speaker: 00:03:39

There's a YAML format you can use to write your migrations, which

Speaker: 00:03:44

specifies for each field, what plugins it's going to transform with

Speaker: 00:03:49

and whatever configuration you add.

Speaker: 00:03:52

So it takes the output of the source plugin and all the source plugins,

Speaker: 00:03:56

regardless of the source CSV, et cetera.

Speaker: 00:04:00

produce a common data structure, which feeds into the transformer.

Speaker: 00:04:05

And the transform pipeline will take one row from that.

Speaker: 00:04:10

And it will go through each piece of the process.

Speaker: 00:04:16

The transform step, uh, we call it process in Drupal and apply the transformers.

Speaker: 00:04:28

And each transformer can take actually multiple fields from the

Speaker: 00:04:34

source row, or it can take none.

Speaker: 00:04:37

You might use a processor that simply sets a constant value.

Speaker: 00:04:43

And the transformers can be very flexible and for the most part,

Speaker: 00:04:48

they're not very Drupal dependent.

Speaker: 00:04:52

You'll do a lot of string manipulation, for example.

Speaker: 00:04:57

Let's see, I'm not sure what else there is to say about the general process.

Speaker: 00:05:02

But

Speaker: 00:05:03

One thing I'd like to add at this point is that sometimes we cheat.

Speaker: 00:05:08

We don't strictly follow the ETL paradigm.

Speaker: 00:05:11

But we take a peek at the database.

Speaker: 00:05:13

So for example, we might look for an existing taxonomy term that

Speaker: 00:05:17

has the name dessert, or we might check how the editor is configured.

Speaker: 00:05:22

So that's one way in which it's very Drupal specific and doesn't

Speaker: 00:05:26

strictly follow the ETL paradigm.

Speaker: 00:05:30

You have complete flexibility.

Speaker: 00:05:32

You can do anything you want.

Speaker: 00:05:35

Good or bad.

Speaker: 00:05:36

Like always.

Speaker: 00:05:40

And maybe while we're mentioning bad things to do in these processors.

Speaker: 00:05:46

It should be noted that this pipeline is run for each source row

Speaker: 00:05:52

in your data, and when dealing with multiple value fields, it might run

Speaker: 00:05:59

several times for one source row.

Speaker: 00:06:01

So the processing pipeline is a key place to watch performance.

Speaker: 00:06:09

One, one slow processor will kill the overall migration process.

Speaker: 00:06:16

Makes sense.

Speaker: 00:06:17

Cause it could be run a lot of times and that adds up, right?

Speaker: 00:06:21

Oh, thousands, millions.

Speaker: 00:06:24

Yeah.

Speaker: 00:06:26

And it can take days as we discussed before.

Speaker: 00:06:29

So Benji, I heard you state in the past that the transform stage is the most

Speaker: 00:06:35

interesting part of the migration and I know for a fact that you are probably

Speaker: 00:06:41

the most excited about it in the whole.

Speaker: 00:06:45

ETL migrate world.

Speaker: 00:06:48

Why is that?

Speaker: 00:06:50

Yeah you're right.

Speaker: 00:06:51

And this is something that I decided when I first started working for migrations.

Speaker: 00:06:55

And by the way I am the most junior member of the current maintainers and

Speaker: 00:07:01

And I have a lot less experience than most of them, or than Mike, so I defer

Speaker: 00:07:06

to Mike on questions of experience and performance in large scale migrations.

Speaker: 00:07:13

But I do have some pretty strong opinions about the transform

Speaker: 00:07:16

stage, the process plugins.

Speaker: 00:07:19

The first reason that it's the most interesting is that any migration

Speaker: 00:07:23

project will be broken up into a bunch of different migrations.

Speaker: 00:07:28

And each one of those migrations will have a single source.

Speaker: 00:07:31

and a single destination.

Speaker: 00:07:33

But any one migration has many fields.

Speaker: 00:07:37

So if you have a migration for your article nodes, it'll have a body field.

Speaker: 00:07:42

It'll have a couple of timestamps.

Speaker: 00:07:44

It might have taxonomy and images and so on.

Speaker: 00:07:46

Each one of those fields is going to have at least one process plugin.

Speaker: 00:07:52

One transformer, as Mike described them, and some fields will

Speaker: 00:07:56

have several transformations.

Speaker: 00:08:00

iN that sense, it's it's where the most variety is.

Speaker: 00:08:04

One, each migration again has one source plugin, one destination plugin,

Speaker: 00:08:10

but can have many transformation plugins or process plugins.

Speaker: 00:08:16

The second thing is that the transform stage, the process

Speaker: 00:08:21

plugins are where you have the most opportunity for reasoning your code.

Speaker: 00:08:28

So if you look at the source plugin, it has to understand whatever cruft

Speaker: 00:08:32

is involved in your source data the site you're migrating from.

Speaker: 00:08:38

And the only time you're going to be able to reuse a source plugin is

Speaker: 00:08:42

if you have the same type of source.

Speaker: 00:08:45

So once you've written a source plugin for a WordPress XML file, you can reuse that.

Speaker: 00:08:53

And once you've written a source plugin for Drupal 6 or

Speaker: 00:08:58

Drupal 7, you can reuse that.

Speaker: 00:09:00

The destination plugin, almost always migrating into Drupal entities.

Speaker: 00:09:05

They could be taxonomy terms or nodes menu links are entities,

Speaker: 00:09:11

um, and the core migration system already understands the destination.

Speaker: 00:09:15

So that's already done.

Speaker: 00:09:17

bUt getting from one to the other is in my opinion, the interesting

Speaker: 00:09:23

part and the part that has the most opportunity for reusing code.

Speaker: 00:09:28

So that's why I think that the transform stage is the most interesting.

Speaker: 00:09:33

Yeah it's, it, for most migrations, you'll find that the source and the the extract

Speaker: 00:09:42

and the load phases you simply need to use

Speaker: 00:09:46

core plugins and some configuration.

Speaker: 00:09:50

You don't usually need to do very much PHP coding.

Speaker: 00:09:54

It's the process plugins where you're most likely to need to write your own

Speaker: 00:10:00

plugins, write your own application logic, because that's where,

Speaker: 00:10:07

you're transmogrifying your data.

Speaker: 00:10:09

You can do the new system.

Speaker: 00:10:11

Although there are some people who prefer to do it all in the source plugin.

Speaker: 00:10:15

They'll just write all their custom PHP there and prepare everything

Speaker: 00:10:21

so that it's ready to be imported.

Speaker: 00:10:23

And again, I don't like that approach because it, you can't reuse

Speaker: 00:10:27

the code if you do it that way.

Speaker: 00:10:31

Yeah, it makes it way harder to reuse it.

Speaker: 00:10:34

It's also against the ETL paradigm, because then you're...

Speaker: 00:10:39

Basically throwing away this separation of different phases that

Speaker: 00:10:44

we're trying to introduce here.

Speaker: 00:10:49

So to be a little bit more concrete, what would be the most common

Speaker: 00:10:54

transform operations in a migration?

Speaker: 00:10:57

Like what would we do in transform process plugins?

Speaker: 00:11:03

Yeah, so by far the most common is just a straight copy.

Speaker: 00:11:07

You have a text field, and you pass it over to the new text field,

Speaker: 00:11:13

which often has the same field name.

Speaker: 00:11:16

Sometimes you decide to change that as part of your site redesign.

Speaker: 00:11:20

That's the most common.

Speaker: 00:11:23

And, that's almost not using a transform plugin at all.

Speaker: 00:11:28

It's technically using the get plugin, but it's not doing any transformation.

Speaker: 00:11:34

Another common thing is that your source has a comma separated list of values,

Speaker: 00:11:41

and you split that into pieces, and you convert each word into a taxonomy term ID.

Speaker: 00:11:48

sO that's something that comes up pretty commonly.

Speaker: 00:11:51

aNother really important one is since Drupal deals with structured

Speaker: 00:11:56

data, you might have references to other nodes, other taxonomy terms

Speaker: 00:12:03

identified by their entity IDs.

Speaker: 00:12:07

And if those entity IDs are changing as they often do in a complex

Speaker: 00:12:12

migration, then you have to translate the old entity id, the ID on the

Speaker: 00:12:17

source system to the new entity id.

Speaker: 00:12:20

And that's possible because the migration system keeps track

Speaker: 00:12:25

of the old and new entity IDs.

Speaker: 00:12:28

So that, that's a really important one.

Speaker: 00:12:32

Some other things you might wanna do is, make your site

Speaker: 00:12:37

better as you're transforming it.

Speaker: 00:12:39

So if you see that people are consistently using CSS classes, font dash bold, size

Speaker: 00:12:45

dash large, color dash red well, you can replace that with my theme dash warning.

Speaker: 00:12:53

And suddenly your CSS markup is a lot more semantic and a lot

Speaker: 00:12:59

easier to maintain in the long run.

Speaker: 00:13:03

Another common one is to convert date formats, like maybe they're in year

Speaker: 00:13:07

month day format, and you want to convert it to a timestamp or vice versa.

Speaker: 00:13:11

And then there are a whole bunch of utility operations.

Speaker: 00:13:15

And you wouldn't think of these as the things you want to do to your data, but

Speaker: 00:13:19

they're the things that end up getting used, in the middle of the process.

Speaker: 00:13:23

So flatten an array, combine several arrays into one, filter out empty

Speaker: 00:13:28

values, or apply a callback function.

Speaker: 00:13:31

So those I think are the most commonly used process plugins.

Speaker: 00:13:36

Mike, am I leaving anything out?

Speaker: 00:13:38

Yeah, I think that those are the key ones.

Speaker: 00:13:43

I'm looking now at the list of all the ones that are in core and

Speaker: 00:13:50

maybe we might want to highlight a few other interesting ones.

Speaker: 00:13:54

here?

Speaker: 00:13:55

While you're looking at it, I just wanted to comment.

Speaker: 00:13:58

The callback one is an interesting one because it almost

Speaker: 00:14:01

lets you cheat a little bit.

Speaker: 00:14:03

If you're, if you need to introduce your custom logic, but you don't want to

Speaker: 00:14:09

create a plugin and go through all that.

Speaker: 00:14:13

You can always use a callback existing callback function process plugin,

Speaker: 00:14:18

and then just create a function in PHP, which will be called.

Speaker: 00:14:23

Or use a basic PHP function.

Speaker: 00:14:27

Or use a basic PHP function.

Speaker: 00:14:30

yOu don't need to wrap trim in a plugin.

Speaker: 00:14:34

You simply use callback, specify trim, it's the callback, and boom, you got it.

Speaker: 00:14:40

Maybe this is a good time to mention that our show notes include some

Speaker: 00:14:43

links to the documentation where we list all of the plugins that are in

Speaker: 00:14:47

core, and those will be available on our pages after we publish this talk.

Speaker: 00:14:58

ONe of the interesting ones is static map.

Speaker: 00:15:01

This is...

Speaker: 00:15:04

It basically, it's like translating enums that is if the source field contains a

Speaker: 00:15:14

finite list of distinct strings and those need to be different on the Drupal side,

Speaker: 00:15:21

you use a static map plugin, which says change this string to that string, and

Speaker: 00:15:28

that's a very handy in a lot of cases.

Speaker: 00:15:31

Or if you're dealing with NFL team names and the Redskins are now the

Speaker: 00:15:35

commanders, you can say that this finite list of names has changed and

Speaker: 00:15:40

anything else you pass through unchanged.

Speaker: 00:15:43

Yeah.

Speaker: 00:15:46

And the Cleveland Guardians in baseball.

Speaker: 00:15:51

I'm seeing sub process, and that is one of the more complicated ones that allows

Speaker: 00:15:58

you to do some really complicated things.

Speaker: 00:16:04

When a field consists of a list, an array, and it allows you to basically

Speaker: 00:16:12

have a sub process pipeline for the pieces of this source field.

Speaker: 00:16:20

And this is very complicated and technical.

Speaker: 00:16:25

I'm not going to go through it right now because...

Speaker: 00:16:28

I always have to relearn it

Speaker: 00:16:29

when I need to use it.

Speaker: 00:16:32

Will a subprocess use the same set of plugins as the main migration?

Speaker: 00:16:38

With source and transform and all those things?

Speaker: 00:16:43

Oh, no.

Speaker: 00:16:44

The source, rather than being a row from your source plugin, At

Speaker: 00:16:50

the source is the contents of the field, the extracted field.

Speaker: 00:16:56

So it's used on fields, which themselves have structure.

Speaker: 00:17:01

But it does use or have access to all the same process plugins that

Speaker: 00:17:06

the general transform stage has,

Speaker: 00:17:09

Which obviously is immensely powerful then..

Speaker: 00:17:14

It is.

Speaker: 00:17:14

It is.

Speaker: 00:17:14

In theory, instead of.

Speaker: 00:17:17

Migrating your taxonomy terms that's a bad example because we've got a

Speaker: 00:17:21

shortcut for that, but user accounts, in theory, instead of doing them

Speaker: 00:17:28

in a separate migration from your main content migration, you could

Speaker: 00:17:34

do them dynamically within a sub process within your content migration.

Speaker: 00:17:39

We do not recommend that, like I said, that the process plugins are very

Speaker: 00:17:44

powerful and you can cut yourself.

Speaker: 00:17:48

Um, there, there is a plugin for copying files, which, if you're going

Speaker: 00:17:53

from one system to another, an old version of Drupal to a new one uh,

Speaker: 00:17:58

you want your images, your videos, your documents to come across too.

Speaker: 00:18:04

And the file copy plugin is very flexible because it gives you.

Speaker: 00:18:09

feW different options for doing that and for doing that performantly.

Speaker: 00:18:15

For example, it could simply copy it into Drupal's public files directory.

Speaker: 00:18:23

And it can keep track.

Speaker: 00:18:28

You can set a flag on it.

Speaker: 00:18:30

So that if the file already exists at the destination, you don't overwrite it.

Speaker: 00:18:34

And that's great for your performance when you're rerunning migrations,

Speaker: 00:18:39

especially during development.

Speaker: 00:18:41

You can also,

Speaker: 00:18:44

now I'm trying to remember the other, but of course you could use it to copy

Speaker: 00:18:51

directly into the files directory.

Speaker: 00:18:53

You could copy it to an S3 bucket.

Speaker: 00:18:57

Or some other, storage service, uh,

Speaker: 00:19:02

Or I think that I had a use case in the past where we needed to copy files from

Speaker: 00:19:09

like another website into our local file system as part of the migration.

Speaker: 00:19:15

And I think that file copy was used for that as well, which

Speaker: 00:19:18

is obviously terribly slow.

Speaker: 00:19:23

And a little tidbit is that the file copy relies on the download

Speaker: 00:19:28

plugin if the source is remote.

Speaker: 00:19:31

And that uses Guzzle in a way that's slightly different from anywhere else

Speaker: 00:19:36

it's used in Drupal Core and caused some interesting test failures years ago.

Speaker: 00:19:42

Yes, you have, sometimes you have to be clever to make things

Speaker: 00:19:46

work and work performantly.

Speaker: 00:19:50

Yes.

Speaker: 00:19:51

I also remember...

Speaker: 00:19:53

During one of the migrations I was working on, we used file copy to copy

Speaker: 00:19:59

files straight from NFS probably to another public sites folder, public files

Speaker: 00:20:09

folder and we were actually copying files and that slowed the migration a lot.

Speaker: 00:20:15

And then we figure out that it's better to rsync before running the migration

Speaker: 00:20:20

and then this, check if the file exists kicks in and you don't need to

Speaker: 00:20:25

copy, you just find it there and that sped up the migration significantly.

Speaker: 00:20:32

But we're getting into performance considerations now, which is

Speaker: 00:20:35

another talk in the future we will be doing, what about contrib?

Speaker: 00:20:40

What kind of interesting process plugins can we find in contrib

Speaker: 00:20:46

that are not part of core?

Speaker: 00:20:49

Oh, so many.

Speaker: 00:20:51

So many, I need to jog my memory here and take a look.

Speaker: 00:20:59

So the Migrate Plus module is an add on to the core migration system, and

Speaker: 00:21:07

it has a number of interesting ones.

Speaker: 00:21:10

There are

Speaker: 00:21:11

several several plugins for manipulating and and scanning

Speaker: 00:21:20

a DOM document object model.

Speaker: 00:21:23

So you can scan your HTML or XML and find, extract, the span with the

Speaker: 00:21:34

TextBold class that's in the, underneath a P, for example, if you need to

Speaker: 00:21:41

manipulate that piece of your content.

Speaker: 00:21:47

There is entity lookup entity generate that makes it easy

Speaker: 00:21:54

to find a matching entity.

Speaker: 00:21:58

That's not one necessarily one that you migrated and what you can find via

Speaker: 00:22:02

the map tables that migration provides.

Speaker: 00:22:06

But if you're migrating in the system, you've got a, maybe a

Speaker: 00:22:09

taxonomy there you want to hook up to.

Speaker: 00:22:12

You can use entity lookup.

Speaker: 00:22:14

To find a matching term in that vocabulary and link to it.

Speaker: 00:22:20

And you can also use entity generate, which does the same thing, but also

Speaker: 00:22:24

if it doesn't find the matching term would create it for you.

Speaker: 00:22:30

lEt's see, there's file blob.

Speaker: 00:22:32

If you've got file data in a database blob, you can convert

Speaker: 00:22:38

that to a real file with that.

Speaker: 00:22:42

File blob reminded me of tHe beginning of my career, which predates

Speaker: 00:22:46

Drupal, where I had experience with proprietary CMS that was really into

Speaker: 00:22:51

storing all files in the database.

Speaker: 00:22:54

That was fun.

Speaker: 00:22:57

Yeah.

Speaker: 00:22:58

So those, and those are the ones that pop out to me immediately.

Speaker: 00:23:03

besides migrate plus, which is a grab bag of several different, uh, plugins there

Speaker: 00:23:10

are several other contributed modules.

Speaker: 00:23:13

That have plugins of all sorts.

Speaker: 00:23:17

And before you go writing your own plugins, take a look through the contrib

Speaker: 00:23:22

modules that are available on drupal.

Speaker: 00:23:24

org and you might find someone that's already solved your problem.

Speaker: 00:23:28

Maybe they've got a SOAP plugin.

Speaker: 00:23:31

Actually, I know they do because I wrote it, but whatever your scenario

Speaker: 00:23:37

is, assume you are not that unique.

Speaker: 00:23:42

Until you prove you are.

Speaker: 00:23:44

And Migrate has been around for years, like probably more

Speaker: 00:23:50

than a decade at this point.

Speaker: 00:23:52

And it's migrated many enterprise.

Speaker: 00:23:58

Large scale applications.

Speaker: 00:24:01

So I'd be almost convinced that if there is a use case, it has

Speaker: 00:24:07

probably already been done.

Speaker: 00:24:11

Yeah, it started out as a contributed module.

Speaker: 00:24:13

Mike and Moshe Weitzman developed it in Drupal 5 or Drupal 6?

Speaker: 00:24:20

6.

Speaker: 00:24:20

Okay.

Speaker: 00:24:20

It was 6.

Speaker: 00:24:22

I think we may have started trying it for 5 and we jumped ahead to 6 because 6 was

Speaker: 00:24:28

It's more conducive to what we were doing.

Speaker: 00:24:32

Yeah.

Speaker: 00:24:33

My first experience with it was the Drupal 7 version.

Speaker: 00:24:36

Yeah.

Speaker: 00:24:37

The first big project the first client you would know was the Economist.

Speaker: 00:24:44

Economist.

Speaker: 00:24:45

com way back in the day.

Speaker: 00:24:49

Which I think also sponsored a lot of initial Migrate module work, right?

Speaker: 00:24:53

Yes.

Speaker: 00:24:54

Yes.

Speaker: 00:24:56

After Economist, it was examiner.

Speaker: 00:24:59

com, which all of them are sponsored.

Speaker: 00:25:04

A lot of D7 work, right?

Speaker: 00:25:06

Yes, they sponsored most of our Port D7.

Speaker: 00:25:11

Martha Stewart Living was about that time too.

Speaker: 00:25:16

Speaking about history, what, do you have any anecdotes or any interesting

Speaker: 00:25:22

or unusual process related use cases that you've experienced in the past?

Speaker: 00:25:32

Oh, boy, they're all jumbled together.

Speaker: 00:25:38

One I don't want to remember is the time her client thought we had created

Speaker: 00:25:44

a major security breach because uh, our development migration suddenly started

Speaker: 00:25:52

sending out emails to all their customers.

Speaker: 00:25:56

And this was the migration, and this is something to watch out for.

Speaker: 00:26:00

The migration system actually explicitly disables, uh, the mail system while

Speaker: 00:26:07

running uh, which we thought was safe, but what happened was a module was enabled,

Speaker: 00:26:13

which during entity creation, which happens during migration queues emails

Speaker: 00:26:20

to be sent and it was fine for a while because this was a development system and

Speaker: 00:26:26

that no one saw but a one little ping on port 88 to that system caused Cron to run

Speaker: 00:26:38

it was using the lazy cron or whatever you call it boom Those, yeah, so those emails

Speaker: 00:26:46

started going out and caused quite a stir.

Speaker: 00:26:51

So yes, you this is something you do need to be careful about is um,

Speaker: 00:26:57

effects, whatever effects the ultimate website might have beyond itself.

Speaker: 00:27:04

Be careful that you control them within your development and testing system.

Speaker: 00:27:12

Which is good advice in general, not just for.

Speaker: 00:27:14

Yes.

Speaker: 00:27:15

And this is where I find DDEV, which is one of the projects that we are

Speaker: 00:27:21

really excited about at Tag1 really useful because I believe that DDEV will

Speaker: 00:27:27

reconfigure your development environment to redirect, uh, emails into like this.

Speaker: 00:27:34

MailHog.

Speaker: 00:27:34

MailHog.

Speaker: 00:27:37

It basically redirects everything in there and it stays just in memory even, I think,

Speaker: 00:27:44

so if you are inside DDEV with regards to mails, you can be pretty sure that

Speaker: 00:27:51

no matter what's going on, you're safe.

Speaker: 00:27:54

And it's handy too for testing your outgoing emails, testing

Speaker: 00:27:59

the formatting or whatever.

Speaker: 00:28:01

Yeah, exactly.

Speaker: 00:28:02

That's probably the usual use case when it was created.

Speaker: 00:28:04

Yeah.

Speaker: 00:28:05

But it, you'll see as a side effect, it also provides a layer

Speaker: 00:28:09

of security and peace of mind.

Speaker: 00:28:12

I, there, there are two points here.

Speaker: 00:28:13

I want to make sure we don't lose track of them.

Speaker: 00:28:16

The first is that while you're developing, no emails will be sent out.

Speaker: 00:28:20

But the second one is equally important.

Speaker: 00:28:22

You have to look at the emails that did get captured by MailHog

Speaker: 00:28:27

because those are the ones that will be sent out in real life when

Speaker: 00:28:30

you're on production and not local.

Speaker: 00:28:33

Yeah, make sure your tokens are being substituted, all that stuff.

Speaker: 00:28:38

Yes.

Speaker: 00:28:38

Benji, what about you?

Speaker: 00:28:39

Do you have any interesting stories from the past?

Speaker: 00:28:43

Yeah.

Speaker: 00:28:43

And, I'm really flattered that when Mike was going through the list of plugins in

Speaker: 00:28:48

Migrate Plus the first ones he singled out were the DOM processing plugins,

Speaker: 00:28:52

because that was one of my contributions.

Speaker: 00:28:55

And And let me call out, this was on a project for Pega Systems, and I was

Speaker: 00:29:01

working for Isovera at the time with Marco Villegas, and we developed the

Speaker: 00:29:07

first DOM processing plugins, and both Isovera and Pega were supportive of

Speaker: 00:29:12

contributing that back to Migrate Plus And I guess the original problem I

Speaker: 00:29:17

was trying to solve is that as I said earlier the node IDs were changing.

Speaker: 00:29:24

So if we have separate entity reference fields, Drupal could already handle that

Speaker: 00:29:29

and just using the migrate lookup plugin.

Speaker: 00:29:31

And you could say that the next article used to be node one, two,

Speaker: 00:29:35

three in the migrated system.

Speaker: 00:29:37

It's node four, five, six.

Speaker: 00:29:38

You can do that translation, but what if you have a text field and

Speaker: 00:29:42

inside that text field, there's an anchor link and the anchor href,

Speaker: 00:29:48

goes to node slash one two three.

Speaker: 00:29:51

How do you translate that to node four five six?

Speaker: 00:29:54

And part of the answer was to use proper DOM processing.

Speaker: 00:29:59

And so I realized I guess I, I had the idea on an earlier project

Speaker: 00:30:04

and I didn't get to make it happen until it was actually needed here.

Speaker: 00:30:09

Everyone knows that you shouldn't be processing HTML with regular expressions.

Speaker: 00:30:15

But people do it anyway.

Speaker: 00:30:16

Yes.

Speaker: 00:30:19

People do it anyway because it's the tool they know it's convenient.

Speaker: 00:30:24

And so the first step was to introduce some process plugins to make it

Speaker: 00:30:29

easy to do proper DOM processing so that you have less overhead in

Speaker: 00:30:36

creating that DOM document object and the XPath object and so forth.

Speaker: 00:30:41

Once you've eliminated the overhead, it is often both simpler and more reliable to

Speaker: 00:30:49

do the proper HTML processing rather than to do things with regular expressions.

Speaker: 00:30:57

In fact, if you look at the search API module, um, I just ran into a case where

Speaker: 00:31:04

it's not only simpler and more reliable, it's also more performant to do processing

Speaker: 00:31:10

and there's there's an open issue on the search API module that that handles that.

Speaker: 00:31:16

So anyway that was my original purpose for putting the DOM processing

Speaker: 00:31:21

plugins in on that project with Pega.

Speaker: 00:31:24

Since then, um, some other people have done work on that.

Speaker: 00:31:29

There's a contrib module that builds on the DOM processing plugins, and it

Speaker: 00:31:34

handles if you have the media module on your Drupal 7 module, on your Drupal 7

Speaker: 00:31:40

site rather, and you want to migrate that to the core media module in Drupal 10.

Speaker: 00:31:46

It understands the tokens that the Drupal 7 media module used and and

Speaker: 00:31:52

handles transforming your text fields.

Speaker: 00:31:55

I had a really complicated project where we not only had to migrate

Speaker: 00:32:05

the site Um, from Drupal 7 to Drupal 8 at that point, I think it was.

Speaker: 00:32:11

We also had to import some really complicated XML documents into the site.

Speaker: 00:32:18

And that project gave me a real appreciation for the expressive

Speaker: 00:32:23

power of XPath, because that was the only way to manage these

Speaker: 00:32:27

really complicated XML structures.

Speaker: 00:32:32

And luckily we already had the DOM processing plugins available.

Speaker: 00:32:37

another complicated project I had was that we had these HTML text fields and and

Speaker: 00:32:47

each text field had just some image tags.

Speaker: 00:32:52

So basic HTML markup and we wanted to download the files from those image

Speaker: 00:32:58

tags and save them as files and create media entities out of them and then

Speaker: 00:33:04

just insert the media references.

Speaker: 00:33:07

Into the text field.

Speaker: 00:33:09

I did that with a custom PHP plugin.

Speaker: 00:33:13

And I do want to point out that this is thing where you can

Speaker: 00:33:17

shoot yourself in the foot.

Speaker: 00:33:18

It's not following the ETL paradigm.

Speaker: 00:33:21

You don't have a row creating each one of those media items.

Speaker: 00:33:26

And it does have certain disadvantages because it breaks the ETL paradigm,

Speaker: 00:33:31

but it is a practical way to handle that sort of situation.

Speaker: 00:33:38

Another weird one I had was...

Speaker: 00:33:41

a Single HTML page, um, in the source site was Drupal 7 and you would think looking

Speaker: 00:33:49

at this page, oh, this page is a view.

Speaker: 00:33:51

It's listing the person content type.

Speaker: 00:33:55

But in fact, it was just a basic page and all the markup was

Speaker: 00:33:59

just there in the body field.

Speaker: 00:34:01

And we wanted to pick it apart and create person nodes and then

Speaker: 00:34:05

create a view of the person nodes.

Speaker: 00:34:07

And so luckily the markup was consistent.

Speaker: 00:34:10

It always started with an H3 tag.

Speaker: 00:34:12

It had a title and that was immediately followed by an image tag.

Speaker: 00:34:17

So there was that consistency that I could take advantage of.

Speaker: 00:34:21

I Extracted the title into a text field.

Speaker: 00:34:24

I extracted the image, created a file media entity, and then and then just

Speaker: 00:34:33

stripped those from the body field.

Speaker: 00:34:36

and let an actual view in the Drupal, I think it was Drupal 9 at that point let

Speaker: 00:34:42

that put the pieces back together to make something like the original source site.

Speaker: 00:34:47

And the last one that I noted down was again for Pega and we

Speaker: 00:34:54

were importing documentation from an external XML based system.

Speaker: 00:35:01

So this wasn't a site migration, this was a recurring migration that someone

Speaker: 00:35:05

was writing the documentation in this external system, and we had to import

Speaker: 00:35:10

it into the Drupal site and make it look like it fit the rest of the site.

Speaker: 00:35:14

And.

Speaker: 00:35:17

And that's where I did the sort of thing I talked about before, where you look for

Speaker: 00:35:21

some consistent pattern of CSS classes and say, okay, we're going to replace

Speaker: 00:35:28

that with something more semantic.

Speaker: 00:35:31

And um, and again this used the DOM plugins.

Speaker: 00:35:35

And it also peeked at the current database the destination database to

Speaker: 00:35:40

see how the editor module was configured so that we could pick and choose

Speaker: 00:35:46

the CSS classes that were that the current site editors would naturally

Speaker: 00:35:52

be adding through the user interface.

Speaker: 00:35:55

And we added those same CSS classes programmatically through the migration.

Speaker: 00:36:01

So those are some of the more complicated cases I've had in the transform stage.

Speaker: 00:36:08

Very nice.

Speaker: 00:36:10

I guess that this brings us to the end of today's episode, unless

Speaker: 00:36:15

you have anything more to add.

Speaker: 00:36:19

I'm good..

Speaker: 00:36:22

We have some great talks coming up.

Speaker: 00:36:25

Our goal is to put one per week over the next few months to support the

Speaker: 00:36:30

community in the migration process.

Speaker: 00:36:33

pErformance is something we care deeply about, Tag1, and we did touch

Speaker: 00:36:37

performance in today's episode a little bit because it applies to migrations.

Speaker: 00:36:43

When you're handling really large data sets, um, a full data migration

Speaker: 00:36:47

can take 12 hours or even days.

Speaker: 00:36:52

we'll do a handful of talks on this topic, including how to profile

Speaker: 00:36:56

and tune a migration, and a talk about incremental migrations.

Speaker: 00:37:02

Every project owner wants their migration to be a success.

Speaker: 00:37:06

So we will dedicate an episode to discuss the most important

Speaker: 00:37:08

factors for a successful Drupal 7 to Drupal 10 migration.

Speaker: 00:37:14

oTher topics includes porting custom code the future of migrate tooling,

Speaker: 00:37:19

how to port a theme, and so much more.

Speaker: 00:37:23

We hope that you'll tune in and enjoy our upcoming team talks.

Speaker: 00:37:30

A huge thank you to the Tag1 team, Benji Fisher and Mike Ryan.

Speaker: 00:37:34

Thank you for joining me.

Speaker: 00:37:36

Make sure that you check out the other segments in this series.

Speaker: 00:37:41

There will be links to them in the show notes, along with links to the

Speaker: 00:37:47

modules and documentation and other things that we mentioned today.

Speaker: 00:37:51

If you like this talk, please remember to upvote, subscribe and share it.

Speaker: 00:37:56

Check out our past talks at Tag1.com/ttt.

Speaker: 00:38:01

That's three T's for Tag1 Team Talks.

Speaker: 00:38:04

As always, we'd love your feedback and any topic suggestions.

Speaker: 00:38:10

You can always write to us at ttt@tag1.Com.

Speaker: 00:38:15

Again, that's three T's for Tag1 Team Talks.

Speaker: 00:38:18

One more time, big thank you to our guests and everybody who tuned in.

Speaker: 00:38:23

Thanks for joining us.

Speaker: 00:38:25

Thanks.

Speaker: 00:38:25

Bye.

Share Episode

Shownotes

Transcripts

Follow

Links

Chapters

Video

More from YouTube