Data Mesh Radio Patreon - get access to interviews well before they are released
Episode list and links to all available episode transcripts (most interviews from #32 on) here
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Katharine's LinkedIn: https://www.linkedin.com/in/katharinejarmul/
Practical Data Privacy (Katharine's book in early release): https://www.oreilly.com/library/view/practical-data-privacy/9781098129453/
Katharine's newsletter: https://probablyprivate.com/
'Privacy-first data via data mesh' article by Katharine: https://www.thoughtworks.com/insights/articles/privacy-first-data-via-data-mesh
danah boyd [sic] website: https://www.danah.org/
In this episode, Scott interviewed Katharine Jarmul AKA K-Jams, Principal Data Scientist at Thoughtworks.
Some key takeaways/thoughts from Katharine's point of view:
Katharine shared how she first started looking into privacy. She was doing natural language processing (NLP) and was working with lots of customer data that was supposedly clean and anonymized - but it wasn't. It raised a red flag for her as that can mean working with some problematic things. So she went looking for solutions and discovered there was a lot of existing privacy technologies - and even more now. We can look at first order privacy questions like how do we actually anonymize data or bigger and more philosophical questions about how do we prevent harm on society from our technical work.
While privacy can be a bit of a nebulous topic, Katharine recommends starting from the gut check, basically: what would you be comfortable with? Would you be okay with your chat history being shared with others? What about your location at all times? If no, look to prevent that from happening in what you are working on. But it's important to also use a multi-cultural lens on what is acceptable as what's okay in the US - or at least tolerated - is not in Europe. And this of course can extend to fines from GDPR but also reputation. Think about how data can be misused and look to prevent that. Scott Note: see episode 143 for more about data ethics and what we can do to prevent misuse.
Given that privacy is very contextual to the individual, K-Jams believes it's far too often that when we translate privacy decisions to code, we lose the context. And part of that is making privacy decisions and options obvious. If you are collecting people's locations, are you giving them an easy ability to see why or opt out? It's very easy for people to feel tricked.
Katharine gave an example of how digital native people - mostly teenagers on Instagram about 10 years ago - had different accounts - called 'Finsta' accounts for fake-Instagram - with different levels of anonymity and privacy to better navigate the opaque privacy settings and share data more granularly with who they wanted. It gave them an ability to control how they were sharing their data and with whom. But most people aren't capable - or willing - to do something like that.
To drive trust, K-Jams believes we need to show people what they are getting from sharing their data, what is the benefit. Then they can make an informed decision. But privacy settings are often opaque at best. She believes that companies leaning into privacy conversations with users will create a better relationship with customers. How are you delivering value back to customers from them sharing data with you? And increasing privacy doesn't mean you have to give up on the value of your data either.
A few privacy techniques where tooling already exists Katharine mentioned are pseudonymization, tokenization masking, and format preserving encryption - she recommends using these as some basics to protect PII (personally identifiable information) or other sensitive information. These are just basic stakes information security best - maybe even just 'not bad' - practices. Then you want to look to potentially layer additional technology on top like differential privacy. We can even leave data where it is to do federated analytics and federated learning, which has implications for data mesh and machine learning.
When looking at the value of privacy, it can be tough to drive buy-in internally - people assume cost with no additional value. But according to K-Jams, you can look at the privacy/utility trade-off. How can we simultaneously maximize privacy while still not inhibiting the work we need to do? And how, in data mesh, do we actually find those sweet spots. Katharine believes it's through giving the data owners the ability to tune privacy - think knobs - to the specifics of the need/use case. That's part of doing federated computational governance through a self-service platform after all. Try saying that 3 times fast…
K-Jams believes it's easiest - at least with current technology - to apply privacy at the data source. But when thinking about something like data mesh, there may be additional challenges like data from domain A and domain C should not be combined. So we are still learning how to do data privacy well in a federated environment. Scott Note: Jesse Paquette covered this in healthcare data in episode 10 where certain anonymized information could be joined with other anonymized data to make it personally identifiable. Many people are saving those "tricky" use cases for later or not trying to automate privacy and cordoning off those data products except by request.
When leveraged well, Katharine believes data privacy technology can actually add more value. If a data producer is not sure how data consumers will use sensitive data, they are very unlikely to share it. But if they can lock down the data in certain ways but still give them access, that is a win-win. The data consumers get access to information they wouldn't have gotten otherwise and data producers can still sleep at night. It can turn a no to a yes. Sarita Bakst mentioned something similar in episode 52. And you can also get past legal and regulatory barriers if you do data privacy right - your legal and regulatory people want to say yes, so give them the ability to turn their no to a yes. Offer up potential privacy concerns offsets - say only using it in a sandbox to start - to see where their issues are.
For K-Jams - and Scott - the desire to remove the people from the technology aspects of things like privacy ends up being silly. We can't make decisions only via the tech. Stop trying to replace conversations with technical solutions, sometimes people just need to collaborate to get where we need to go. Don't make it a yes or no decision for someone like legal, exchange context and look to collaborate on a positive outcome instead of 'can I do this?'
Katharine gave a good overview of how to move up the privacy ladder about 47 minutes into the interview (not the episode; likely ~55 minutes into the episode?). How do you move from not so great to okay to pretty good to good (but using meh -> eh -> heh -> hah because of Scott…). Privacy isn't all or nothing and you can improve and iterate.
Second-layer privacy enhancing techniques mentioned: differential privacy, data minimization, federated analytics, federated learning, distributed querying, encrypted computation, and secure multi-party computation.
You probably won't get your privacy perfect on your first try. That's okay. Look to prevent regulatory/compliance issues but much like all aspects of data mesh: try, learn, iterate.
Think about what you can and cannot show in a data catalog about potentially sensitive data. You can share descriptive statistics and information about use cases without exposing the sensitive data until you know a new use-case is allowed/ethical. Look to share as much information as you can - where appropriate - instead of locking down anything related to sensitive information.
Empower the people who know the data best with privacy tooling. Don't make them build it themselves but they will know best most of the time - but obviously provide them a path if they have questions/concerns.
It's very easy for privacy concerns to become overbearing. If 90% of the time, you reject use cases, you will create shadow IT and that is far more dangerous for legal and regulatory reasons. Look to exchange context and work towards a viable solution.
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under "add payment"): AstraDB