#254 Easing Into a Data Mesh Journey - Ocean Spray's Pre-Data Mesh Preparations - Interview w/ Paul Cavacas

Episode 254 • 24th September 2023 • Data Mesh Radio • Data as a Product Podcast Network

Please Rate and Review us on your podcast app of choice!

Get involved with Data Mesh Understanding's free community roundtables and introductions: https://landing.datameshunderstanding.com/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

Episode list and links to all available episode transcripts here.

Provided as a free resource by Data Mesh Understanding. Get in touch with Scott on LinkedIn if you want to chat data mesh.

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.

Paul's LinkedIn: https://www.linkedin.com/in/paul-cavacas-32a36158/

In this episode, Scott interviewed Paul Cavacas, Senior Manager of Data and Analytics at Ocean Spray.

Quick note before jumping in: Ocean Spray is just at the beginning of their journey - in their pre-implementation phase - and there hasn't been a lot of resistance yet internally. That might make a few people jealous 😅 but there's a lot of interesting things Paul is doing to ensure that they are ready to decentralize what makes sense to decentralize at the right time. There is a lot to be gained from not rushing in. Also, apologies that Scott's audio is a bit weird, he had yet to build his makeshift sound studio in the Netherlands.

Some key takeaways/thoughts from Paul's point of view:

As many have stated, asking the data team - especially one person - to become an expert on many different areas of the business just to complete data work for a project just won't scale. At best it creates incredibly concentrated tribal knowledge. Use this point to drive buy-in for decentralizing data ownership.
Having someone who really knows your internal IT application landscape well can really help in choosing which initial teams to start with for a data mesh implementation. That person already has good relationships and a deep understanding of your operational plane so you can pick good problem areas and partners.
Similarly, build your early buy-in momentum with people that are more likely to be excited to participate in a data mesh implementation. You don't need to convince the most difficult teams to participate at the start.
Central ownership isn't necessarily bad until things stop scaling. Having that central ownership means less flexibility and agility to react quickly to market changes or opportunities but also less cognitive load on teams. It's a trade-off.
Many of your domains really won't understand data ownership. Find ways to slowly transition them into understanding what ownership entails e.g. starting with documentation and SLOs. What data are they sharing and why does it matter? This isn't going to happen overnight.
If you aren't building overly complex data products, look to find people within the domain that are somewhat technically savvy - especially if they want to advance their careers - and start to prepare them for data ownership. Those might be your data product developers or data product managers. Scott note: Brian McMillan talked about a plan to do that in episode #26.
?Controversial?: Even if you aren't looking to move fast with your data mesh implementation, look to put at least a proto platform in place so people can at least test out ideas if they want to move more quickly.
Setting up your testing and data contract framework will probably be a big challenge. Talk to people about what they need and iterate towards it. But it will be difficult to figure out at the start and you'll start with something sub-optimal. Be prepared and go forward anyway :)
Focus more on finding the "right" partners - those willing to really partner and learn and try - rather than the "right" domain that has the most valuable data or use case. This comes up in many episodes.
If you are looking for the right partners, there are good signals to watch for such as who presents their internal results with more advanced data graphics because they are likely a domain that really cares about the data and want to dig in more.
Consider how much need there is to decentralize certain aspects and if there is enough capacity and/or work for a domain. Those might sound different but if there isn't a ton of data work, that's a lot of effort for a domain to learn to own data for a small amount of data work. Scott note: Yushin Son at JPMC mentioned they are creating a central team for small domains that can't justify a full data product team in the panel episode #233.
When creating a satisfaction scoring system for data products, consider what is the goal of the satisfaction score and how long that feedback is useful. So when a data product is in beta, harsh but useful scores are valuable but a 2 on data quality before v1.0 might not be relevant once it is actually a released product. It's okay to wipe past scores clean.
?Controversial?: When looking at your first data products, you don't want to lift and shift existing data assets to your platform. But, you can go after existing use cases as your early products if you are providing the same thing in a better/easier format.
Don't let perfect get in the way of good/progress. It's okay to put in solutions that won't work in the long run as long as you acknowledge they will need to be replaced and give yourself the room to replace them. It's taking on technical debt through a conscious choice and not locking yourself in.
?Controversial?: When you are getting started, make sure to work with a few different areas to prevent yourself from delivering a solution, even at the mesh level, that is overly targeted to one domain's needs.

Paul started off with a bit about why they are headed down the data mesh path. For a large internal project, Paul had to become an expert on so many aspects of the company and that's just not scalable in the long-term or if he's on vacation. So, he's started to decentralize the data capabilities - slowly - as teams understand what data they will likely need to own in the long-run. And he's not playing data ownership 'hot potato', he's making sure they are prepared in the right ways.

At Ocean Spray, Paul shared that until recently everything tech - including data - was very centrally owned. In some areas, the IT team knew possibly more about the business processes than even the business people in those domains. So the company is going through all of their software and applications to decide how that should look in the future. Data mesh plays well into that rethink because central ownership scales until it doesn't and limits flexibility.

As there isn't a rushed timeline, Paul has been able to put together a complete idea of what the data mesh roadmap will look like. But he also understands that it could look completely different as he learns more and starts trying to actually implement different aspects. There are some existing data sets/assets out there that could pretty easily become data products in the right environment so that is where they are targeting first. They are working with the teams to transfer some part of the ownership, especially around documentation of use cases and SLOs. The central team is pairing to take existing data assets, decompose those into their data products and help people get on the path to real ownership.

Paul recommends what Brian McMillan talked about in depth in episode #26: finding people within domains that are at least somewhat tech savvy and want to advance their careers. Work with them to get them more and more up to speed. Ownership is not something that gets transferred in a day - treat it with more respect than that. So that's finding receptive people inside the receptive domains. Yes, it won't always be easy but why make the buy-in complicated at the start if you don't need to?

Right now, Paul is building out some of the technical underpinnings of the platform they plan to build. If there are teams that want to move more quickly, they can start to test things out now. As long as those teams understand things aren't fully automated and they may have to change things about what they build now when the company starts to fully move to data products. One big piece he is anticipating is the need for testing and data contract mechanisms. But exactly how to do that is still a challenge and will be learned along the way. He's anticipating a workable but not perfect solution to start. Build to useful and then improve.

Paul circled back on the idea of finding the right partners over the right use cases/domains. Having engaged and excited partners, who know you can up their own data capabilities and drive value for them too, will make your early journey far easier than going for the most "valuable" data. You are also likely to get better feedback because they are bought in to collaborating with you! To find those partners, potentially look at how teams present their results internally. If they are presenting with lots of advanced figures and almost a flair around data, that is great sign.

How much data ownership/work gets decentralized and when is a key remaining question for Paul. He's aware that he'll have to test what works and iterate as he learns but there are plenty of domains that are too small to justify them learning a ton about how to own data when there just isn't that much data/data work to deal with. There will be a shared ownership model between the central team and the domains. Scott note: this works up to a certain scale and in certain types of organizations. Shared ownership in a very large organization rarely works that well for all that long - too much political infighting and challenges but it's an interesting pattern for smaller orgs that seems to be working well.

Paul's plan for assessing the quality of data products is to create a rubric scoring system - asking people to rate them across multiple dimensions like usability, data quality, SLA compliance, etc. And that the scores or how they are measured may change across time. At the start of a data product's life, when it's still in beta, those scores can be invaluable to iterate towards value but then consider throwing the historical scores out once it hits that v1.0. That's because there is a useful aspect of feedback depending on what you are trying to achieve and bad historical scores could hinder the success of a data product when it is now very high quality and valuable.

For Ocean Spray, their first few data products are going to be source aligned, combining a lot of important sales information. That way, those people who want raw data can still get at it but then they can build out more and more views/data products for the users on top of those. That way, there is still the scalable/productized underlying production of the raw data and then more fit-for-purpose outputs for the different users.

Paul is not letting perfect get in the way of progress. Data contracts have to get to a place where we aren't locked onto schemas as something that can never change. But no one has really come out with a better solution yet, so that's what he's doing to start. It's better than nothing so go with it while you figure out better ways.

Paul finished with a bit of advice around working with a few domains at the start of your journey. That way, you can take the learnings and understand the needs from multiple domains to abstract to a better solution for the organization rather than one overly tied to one domain's needs. Scott note: people seem pretty 50/50 split on working with one domain or 2-3 at the start of your journey. It's an interesting question.

Other unique factors of Ocean Spray:

The corporate structure is a co-op of growers so there isn't some massive pressure to grow at all costs.

Domains have been able to get access to other domains' data relatively easily for a long, long time. It hasn't been cleaned and prepared for them but there is an existing culture of sharing.

They are moving more and more to 3rd party applications rather than custom-built, which means data isn’t necessarily in an easy to consume format by default. (maybe not all that unique?)

Because many domains are quite small, the central team will likely still own most if not all of the data work for those domains.

Learn more about Data Mesh Understanding: https://datameshunderstanding.com/about

Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Share Episode

Shownotes

Follow

Links

Chapters

Video

More from YouTube