Artwork for podcast Data Mesh Radio
#157 Getting Practical with Data Privacy - Interview w/ Katharine Jarmul AKA K-Jams
Episode 15720th November 2022 • Data Mesh Radio • Data as a Product Podcast Network
00:00:00 01:13:55

Share Episode

Shownotes

Data Mesh Radio Patreon - get access to interviews well before they are released

Episode list and links to all available episode transcripts (most interviews from #32 on) here

Provided as a free resource by DataStax AstraDB; George Trujillo's contact info: email (george.trujillo@datastax.com) and LinkedIn

Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.

Katharine's LinkedIn: https://www.linkedin.com/in/katharinejarmul/

Practical Data Privacy (Katharine's book in early release): https://www.oreilly.com/library/view/practical-data-privacy/9781098129453/

Katharine's newsletter: https://probablyprivate.com/


'Privacy-first data via data mesh' article by Katharine: https://www.thoughtworks.com/insights/articles/privacy-first-data-via-data-mesh


danah boyd [sic] website: https://www.danah.org/


In this episode, Scott interviewed Katharine Jarmul AKA K-Jams, Principal Data Scientist at Thoughtworks.


Some key takeaways/thoughts from Katharine's point of view:

  1. Increasing privacy around data does NOT mean you have to give up value.
  2. Instead of data privacy being a blocker, it can turn nos to yeses because there is a better ability to restrict illegal/unethical use. Regulatory and legal people want to say yes, so give them the ability to do so.
  3. There are lots of tools available to enhance your data privacy now. This isn't a pipe dream. That said, don't look to replace person-to-person conversations and decisions with tech. You'll learn when to use what on your journey, it's okay to iterate :)
  4. Empower the people who know the data best with privacy tooling. Don't make them build it themselves either. They will know best most of the time - but obviously provide them a path if they have questions/concerns.
  5. Privacy is a sliding scale, not all or nothing. You can start off pretty not great and still make progress as you continue to assess where you can be better.
  6. Use privacy as a lens for how valuable your sensitive information actually is. Can it be used appropriately? If so, what's the value of leveraging that data and is it worth the privacy cost?
  7. When thinking about what level of privacy to enact in systems, think about your comfort level if it were you. Would you want your location history shared at all times? Using a multi-cultural approach is important too as different cultures have different norms around privacy.
  8. There are some basic stakes privacy tech you should use as part of just general information security to protect sensitive information. It might even be required by law. Then, you should look to layer on privacy-enhancing technology on top to do things like actual anonymization.
  9. Privacy choices should be broad organization-driven decisions, not something one person decides to implement. But if there isn't buy-in, it can sometimes be tough to show people the business value of data privacy.
  10. Look to the privacy/utility trade-off - how can we maximize privacy but also maximize what we need to do the task at hand?
  11. An emerging practice that has a big potential impact on privacy in data mesh is federated analytics and distributed queries - can we do analytics on data where it is without moving it to a big central place to do the analysis?
  12. A lot of data scientists don't want to work on things they feel are problematic. So work to prevent problematic use cases and the problematic data practices that can give rise to them.
  13. Privacy can be about first order problems like how do you anonymize your data. But it can also be much broader, like understanding the impact of what work we do on people and society as a whole.
  14. We need to make privacy more transparent and obvious, otherwise people feel tricked. People want to know what they get out of sharing their information.
  15. In data mesh, look to offer the easy ability to adjust privacy - privacy 'knobs' - so the domain expert can easily make choices without having to implement the tech. Enable privacy via the platform.
  16. Typically, data privacy is applied at the data source. We are still learning how to do data privacy well in a federated setup with any cross domain data combination restrictions.
  17. If you're constantly rejecting use cases, that will just create shadow IT. As an organization, figure out how you can get to a yes where you can.
  18. Scott Note: K-Jams is writing her book to fill the gap in information about privacy technology between very basic info and academic research.



Katharine shared how she first started looking into privacy. She was doing natural language processing (NLP) and was working with lots of customer data that was supposedly clean and anonymized - but it wasn't. It raised a red flag for her as that can mean working with some problematic things. So she went looking for solutions and discovered there was a lot of existing privacy technologies - and even more now. We can look at first order privacy questions like how do we actually anonymize data or bigger and more philosophical questions about how do we prevent harm on society from our technical work.


While privacy can be a bit of a nebulous topic, Katharine recommends starting from the gut check, basically: what would you be comfortable with? Would you be okay with your chat history being shared with others? What about your location at all times? If no, look to prevent that from happening in what you are working on. But it's important to also use a multi-cultural lens on what is acceptable as what's okay in the US - or at least tolerated - is not in Europe. And this of course can extend to fines from GDPR but also reputation. Think about how data can be misused and look to prevent that. Scott Note: see episode 143 for more about data ethics and what we can do to prevent misuse.


Given that privacy is very contextual to the individual, K-Jams believes it's far too often that when we translate privacy decisions to code, we lose the context. And part of that is making privacy decisions and options obvious. If you are collecting people's locations, are you giving them an easy ability to see why or opt out? It's very easy for people to feel tricked.


Katharine gave an example of how digital native people - mostly teenagers on Instagram about 10 years ago - had different accounts - called 'Finsta' accounts for fake-Instagram - with different levels of anonymity and privacy to better navigate the opaque privacy settings and share data more granularly with who they wanted. It gave them an ability to control how they were sharing their data and with whom. But most people aren't capable - or willing - to do something like that.


To drive trust, K-Jams believes we need to show people what they are getting from sharing their data, what is the benefit. Then they can make an informed decision. But privacy settings are often opaque at best. She believes that companies leaning into privacy conversations with users will create a better relationship with customers. How are you delivering value back to customers from them sharing data with you? And increasing privacy doesn't mean you have to give up on the value of your data either.


A few privacy techniques where tooling already exists Katharine mentioned are pseudonymization, tokenization masking, and format preserving encryption - she recommends using these as some basics to protect PII (personally identifiable information) or other sensitive information. These are just basic stakes information security best - maybe even just 'not bad' - practices. Then you want to look to potentially layer additional technology on top like differential privacy. We can even leave data where it is to do federated analytics and federated learning, which has implications for data mesh and machine learning.


When looking at the value of privacy, it can be tough to drive buy-in internally - people assume cost with no additional value. But according to K-Jams, you can look at the privacy/utility trade-off. How can we simultaneously maximize privacy while still not inhibiting the work we need to do? And how, in data mesh, do we actually find those sweet spots. Katharine believes it's through giving the data owners the ability to tune privacy - think knobs - to the specifics of the need/use case. That's part of doing federated computational governance through a self-service platform after all. Try saying that 3 times fast…


K-Jams believes it's easiest - at least with current technology - to apply privacy at the data source. But when thinking about something like data mesh, there may be additional challenges like data from domain A and domain C should not be combined. So we are still learning how to do data privacy well in a federated environment. Scott Note: Jesse Paquette covered this in healthcare data in episode 10 where certain anonymized information could be joined with other anonymized data to make it personally identifiable. Many people are saving those "tricky" use cases for later or not trying to automate privacy and cordoning off those data products except by request.


When leveraged well, Katharine believes data privacy technology can actually add more value. If a data producer is not sure how data consumers will use sensitive data, they are very unlikely to share it. But if they can lock down the data in certain ways but still give them access, that is a win-win. The data consumers get access to information they wouldn't have gotten otherwise and data producers can still sleep at night. It can turn a no to a yes. Sarita Bakst mentioned something similar in episode 52. And you can also get past legal and regulatory barriers if you do data privacy right - your legal and regulatory people want to say yes, so give them the ability to turn their no to a yes. Offer up potential privacy concerns offsets - say only using it in a sandbox to start - to see where their issues are.


For K-Jams - and Scott - the desire to remove the people from the technology aspects of things like privacy ends up being silly. We can't make decisions only via the tech. Stop trying to replace conversations with technical solutions, sometimes people just need to collaborate to get where we need to go. Don't make it a yes or no decision for someone like legal, exchange context and look to collaborate on a positive outcome instead of 'can I do this?'


Katharine gave a good overview of how to move up the privacy ladder about 47 minutes into the interview (not the episode; likely ~55 minutes into the episode?). How do you move from not so great to okay to pretty good to good (but using meh -> eh -> heh -> hah because of Scott…). Privacy isn't all or nothing and you can improve and iterate.



Quick tidbits:

Second-layer privacy enhancing techniques mentioned: differential privacy, data minimization, federated analytics, federated learning, distributed querying, encrypted computation, and secure multi-party computation.


You probably won't get your privacy perfect on your first try. That's okay. Look to prevent regulatory/compliance issues but much like all aspects of data mesh: try, learn, iterate.


Think about what you can and cannot show in a data catalog about potentially sensitive data. You can share descriptive statistics and information about use cases without exposing the sensitive data until you know a new use-case is allowed/ethical. Look to share as much information as you can - where appropriate - instead of locking down anything related to sensitive information.


Empower the people who know the data best with privacy tooling. Don't make them build it themselves but they will know best most of the time - but obviously provide them a path if they have questions/concerns.


It's very easy for privacy concerns to become overbearing. If 90% of the time, you reject use cases, you will create shadow IT and that is far more dangerous for legal and regulatory reasons. Look to exchange context and work towards a viable solution.


Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him at community at datameshlearning.com or on LinkedIn: https://www.linkedin.com/in/scotthirleman/

If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/

If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here

All music used this episode was found on PixaBay and was created by (including slight edits by Scott Hirleman): Lesfm, MondayHopes, SergeQuadrado, ItsWatR, Lexin_Music, and/or nevesf

Data Mesh Radio is brought to you as a community resource by DataStax. Check out their high-scale, multi-region database offering (w/ lots of great APIs) and use code DAAP500 for a free $500 credit (apply under "add payment"): AstraDB

Links