Data Mesh Radio Patreon - get access to interviews well before they are released
Episode list and links to all available episode transcripts (most interviews from #32 on) here
Provided as a free resource by DataStax AstraDB; George Trujillo's contact info: email (firstname.lastname@example.org) and LinkedIn
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Debra's LinkedIn: https://www.linkedin.com/in/privacyguru/
Debra's Shifting Privacy Left Podcast: https://shiftingprivacyleft.com/
Katharine's LinkedIn: https://www.linkedin.com/in/katharinejarmul/
Katharine's book: https://www.oreilly.com/library/view/practical-data-privacy/9781098129453/
Samia's LinkedIn: https://www.linkedin.com/in/samia-rahman-b7b65216/
Quick acronyms to know: PETs - privacy enhancing technologies; SMEs - subject matter experts
Scott Note Warning: there is some nerding out about how awesome it could be if some advanced privacy approaches and PETs were implemented at a broad scale across the industry to protect individual's privacy. It's pretty early days so warning about getting your hopes up :)
In this episode, guest host Debra Farber, privacy expert and host of the Shifting Privacy Left podcast facilitated a discussion with Katharine Jarmul, the author of the upcoming book Practical Data Privacy and Principal Data Scientist at Thoughtworks (guest of episode #157) and Samia Rahman, Director of Data and AI Strategy and Architecture at life sciences company Seagen (guest of episode #67).
Scott note: given this is a newer area, I wanted to share my takeaways rather than trying to reflect the nuance of the panelists' views. This will be the standard for panels going forward.
Scott's Top Takeaways:
- Privacy has been dominated by risk compliance historically but it's starting to move past the defensive governance aspects. Privacy has mostly been at the tail-end of the development cycle across systems and data but is starting to shift left across the board much like aspects of data mesh.
- Regarding data mesh, given how many additional aspects of development we are asking domains to own, is it fair to ask them to own privacy as well? How can we train people to understand when to use privacy enhancing technology and then make it easy to implement those decisions? Much like other aspects of the self-serve platform.
- Privacy tech is emerging and maturing at a very significant rate. What was once a pipe dream or was prohibitively expensive is much closer to being available to the masses. Much like data mesh in general couldn't really have been done before cloud-native tooling/technologies started to mature, privacy is in a similar wave forward. If you want to hear more about specific tech, there's some talk in this episode and Katharine's interview (episode #157).
- With the explosion in upcoming privacy-focused legislation across the world - much of which is at least slightly different from each other - we will see a large increase in the need for organizations to do privacy well/better. Shifting left is really the only way to do this scalably or we'll potentially see organizations _stop_ doing a lot of currently valuable data work because the cost of privacy and risk compliance becomes prohibitive. Upcoming legislation may be the thing pushing privacy forward more than anything else.
- Privacy is extra crucial when dealing with data leaving the organization, whether in partnership or for those selling their data on a marketplace. Mozhgan Tavakolifard, PhD (episode #154) talked about this where many companies are opting to merely package and sell insights because they can't track usage deep into those partner or data purchaser systems. It's a bit like what Zhamak discussed re data going into the data science area of an organization right now and all governance and visibility at best gets hazy in Zhamak's Corner 18 (episode #195). Will cross-org data mesh help address that? Probably but it's 5+ years away.
- Paraphrasing Debra: At the end of the day, privacy is not about compliance. Privacy is about respecting the humans behind the data, not just the data itself. Protecting the data itself is about risk to the organization, that's compliance. We need to encourage a mindset of how do I respect these human's choices and their desires in this context of collecting the data, that's essential.
- If you don't do privacy well, there are risks to the company of course. But a big one is that people will still look for - and usually find - ways to get access to sensitive data. People will seek out the value. If you make data easily accessible with the right privacy levels, you can unlock many high-value new use cases in compliant and low-risk ways. Organizations should start to look at the rewards of doing privacy well, not only the risks of doing it poorly.
Other Important Takeaways (many touch on similar points from different aspects):
- New ways of doing privacy are going to mean "measurable, quantifiable, verifiable, and auditable tools and capabilities."
- We need to think of privacy like any other tooling in the development lifecycle. It's about providing abstractions to make the easy/right calls to the domain experts and having a central support structure when things are more tricky.
- Privacy isn't just about the data, it's not simply a metadata-like concept people are trying to add back privacy to data at rest. Much of the important aspects of privacy are about how the data flows through systems and privacy in those flows and each of the systems, not just the end place it gets stored.
- In data mesh, the self-service platform will need - currently needs? - to provide privacy-as-code capabilities and so people can easily build data products with privacy built-in instead of added at the end of the process. We can't expose the tech, that's far too complex. How do we provide the good abstractions to make this easy and thus scalable?
- We need an ability to almost have a privacy capability as an ingest mechanism - point at a data source and say "we need this anonymized" and it's not a super custom build by the producers. We're just at the start of developing those types of capabilities but we need to make it so it's not all on the producer, consumers can consume with privacy on demand.
- We need policies as code or other easily digestible forms of policies - and compliance - and need to train our people well on what privacy means, why it's important, when to apply, etc.
- We need tooling to help with federated privacy because otherwise, there is too much privacy context/knowledge AND technology to learn and it won't be scalable. It appears there are some tools emerging but it's still seemingly early days.
- Anonymization is often pretty easy to overcome if you just add additional datasets. This is especially a risk in sharing data with other organizations. Anonymization isn't a wand you wave and all your privacy risks get taken care of.
- How can we still derive the value of anonymized datasets? It's often much harder so will companies do the ethical privacy aspects or only the required aspects of privacy? We need better, easier PETs to make it easier to still extract value from anonymized data.
- How do we balance enough privacy training and not hit information overload? It's hard to get people to learn what's necessary because privacy is such a big topic. We need global and domain-level policies that can again be actually digestible.
- Can we measure time to compliance, time to privacy, time to 'doing the right thing' ethically? That would be best to understand where we need to improve but we're probably just at the start of that. This is an interesting fitness function area.
- Subject matter experts (SMEs) have so much specific knowledge that you need to leverage them to discover privacy risks and privacy rewards too. Much like any aspect of governance, trying to have the central team make decisions just isn't scaling so we need to make the people in the domain capable enough to handle privacy.
- Privacy rewards: in many organizations, there are very high value sets of data that cannot be leveraged for specific use cases due to privacy and other compliance restrictions. Getting to a place where we can easily leverage that high risk but high value data will potentially unlock large amounts of business value.
- There are lots of instances of teams finding those high value data sets and using shadow IT to get at them. If that's the only way people can get access, many will completely skirt any compliance and privacy. So getting to a place where they have access but it's according to policy and tracked is crucial to lower organizational risk - from compliance and ethics wise.
- From Katharine: "But if you build easier ways to get access and safer, more responsible, more ethical ways to get access, then you have a win-win situation and people are not going to find shortcuts."
- Companies are starting to loosen the shackles on data and focus on maintaining privacy but also enabling innovation around privacy-sensitive data. That's a great mindset but there are still many questions on how to do that specifically.
- Too many, especially in blockchain, conflate privacy and confidentiality. Keeping something confidential is a security aim. So if you focus on confidentiality, you can't actually use the data you have - no one is allowed access, it's on lockdown.
- We need to get far better at risk modeling for privacy. What are the potential harms by the humans? We need to move beyond only thinking about if there might be a breach, what data might they get. We can free up data for far more uses if we do this right but ethics around data usage is just not a common thought. We need to train people to think ethically and about potential harm.
- There are multiple issues with anonymized data. Are you taking away the utility? Are you fooling yourself into thinking it can't be de-anonymized? That's a pretty common outcome with adding additional data sets - especially a risk if sharing data externally. Don't treat anonymization as your hammer and everything looks like nails.
- We need to teach developers about differential privacy which is about "bounding the probability of someone learning" a specific thing. Differential privacy "got a bad reputation" but we can add noise and maintain accuracy now. It is the "gold standard for anonymization".
- Healthcare patient data is one of the biggest challenges in privacy because you want to maximize the efficacy of care but also maximize privacy. And then how do companies take the data of the individual to extrapolate further to see broader trends?
- We need to get people upskilled so they can understand when to transform data in privacy preserving ways - and then the self-serve platform needs to make it easy for them to do that. But we don't have great industry-wide understanding on how to do either of those that well yet.
- Self-sovereign identity, while very interesting, is probably a long way away from being widely adopted. There needs to be a lot of industry collaboration and agreement and it's not really a big benefit in many areas based on the legal requirements versus cost. It would be great for company-to-company interoperability with privacy but who will build it? The 3 panelists were very excited about it though :)
- Privacy and data sovereignty are going to be intermingled in interesting ways in data mesh. Querying data where it is instead of piping it all over the world* will help maintain privacy and comply with laws - many countries don't allow data to be exported as is.
* see Zhamak's Corner 13 episode #173 that covers some of what querying data where it is means and that's not necessarily about source systems but it does mean not moving it without necessity
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf
Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under "add payment"): AstraDB