The Center for Data Innovation spoke with Kevin Yee, co-founder and CTO of betterdata, a Singapore-based startup focusing on synthetic data. Yee discussed some of the privacy-preserving technologies that he expects to emerge in the coming years.
Gillian Diebold: How can synthetic data help businesses innovate?
Kevin Yee: You’ve probably heard the phrase that data is the new oil—an asset that has significant value beyond its current use. Many people have this perception because enterprises use data to develop, experiment, and innovate. On the flip side, data leakages, like oil leakages, can be extremely devastating to organizations, people, and society.
Businesses now need to innovate with data that contains valuable insights into a customer’s behavior, yet they also must handle the risks involved and the ever-changing spectrum of user expectations. This includes security liabilities and privacy concerns, especially when data contains personally identifiable information (PII) vulnerable to leakages that can put an organization at reputational and regulatory risk.
Due to the intangibility of data, organizations have no structured way to measure the risk-reward ratio of using data. This often leads to a more conservative approach where data is siloed in databases—unused and unleveraged. It is a case of uncertainty, and we all know that uncertainty cannot be quantified.
All this may sound daunting at first, but this is exactly where synthetic data takes the spotlight. Synthetic data helps organizations make data freely accessible and portable across teams, businesses, and international borders. Advanced AI techniques such as generative adversarial networks (GANs) are able to produce synthetic data that keeps the statistical properties and patterns of the original data while ensuring privacy by having a near-zero risk of re-identification compared to current data anonymization methods where the risk is super high.
Whether to support artificial intelligence and machine learning (AI/ML) development or share data internally and externally, the artificially-generated synthetic data can be used as a substitute for real data with full accessibility and compliance. So, organizations can now innovate with synthetic data without the risk and compliance hurdles of using real data.
Diebold: How does differential privacy protect user data?
Yee: Speaking from an AI perspective, differential privacy is one of the predominant techniques used to prevent deep learning models from exposing users’ private information in the datasets used to train them.
Pioneered by Cynthia Dwork at Microsoft Research, it has been widely adopted by tech giants to “learn” about the extended user community without learning about specific individuals. So, a synthetic dataset produced by a differentially private model protects user data by providing privacy guarantees backed by publicly available mathematical proofs while keeping the same schema and maintaining most of the statistical properties of the original dataset.
The key to the whole technique here lies in balancing privacy and accuracy with a parameter called ε (epsilon)—the smaller the ε value, the greater the privacy is preserved, but the lower the data accuracy. With a carefully chosen ε value, it is possible to create a synthetic dataset with a fairly high utility while ensuring sufficient privacy.
What this means is that the differentially private synthetic data mitigates different privacy attacks such as membership inference and model inversion attacks that can potentially reconstruct the training data in part or whole because of information leakage from a trained AI model.
Diebold: Can you explain how synthetic data can lead to “fairer” AI models? What does “fairness” mean?
Yee: This topic is very much up for debate, with no right or wrong answers. Fairness is a complex concept that means different things in different contexts to different people. Let us say that for AI practitioners, fairness tends to be viewed from a quantitative perspective where algorithms are subjected to fairness constraints involving sensitive and legally protected attributes. The goal is to ensure the algorithms perform well in real life while also treating people “fairly” and without bias with respect to attributes such as race, religion, job, income, gender; the list goes on.
It is fair to say that there is no single cause of bias and therefore, no single solution. However, a good remedy could be at its source—the data itself. One way to reduce bias in a dataset is to ensure demographic parity across protected subgroups where membership in a protected subgroup has no correlation with the predictive outcome of a downstream AI/ML model. Simply put, an AI model should not discriminate against any attribute, and for that, a “fixed” version of a dataset is very much needed.
Let’s say we have a citizen income dataset where demographic parity is not satisfied in the protected “sex” variable. In other words, there is a higher proportion of males compared to females in the high-income category. Fixing bias at the data level can be achieved with synthetic data because of full control over the data generation process. This allows us to generate an equal proportion of males and females in both the high- and low-income category to remove the correlation between the “sex” and “income” and mitigates the bias of income with respect to gender.
Reducing bias through a quantitative perspective is only one step. With fairness lying in the intersection of law, social science, and technology, the issue of fairer AI models cannot be addressed only through any one avenue and would require a set of diverse stakeholders to provide their perspectives to shape decisions and future policies.
Diebold: What are some real-world use cases for synthetic data?
Yee: I personally believe synthetic data is the future for open data innovation and a responsible data economy. There are a ton of use cases out there, but let me share one that sits close to my heart. Say you are a facial recognition company that uses face images to train an AI model and classify people. Let us assume most of the images belong to a specific skin tone, leading to a high classification accuracy for that skin tone and not the others. By using synthetic data, faces with all sorts of skin tones can be generated, and the AI model can be improved to better detect people that were previously misclassified due to a lack of data.
Speaking on a broader level now, Amazon is using synthetic data to train Amazon Go vision recognition and Alexa’s language systems. Roche, one of the industry-leading pharmaceutical companies, is using synthetic medical data for faster and cheaper clinical research and trials. Google Waymo is using synthetic data to train its autonomous vehicles. Ford is combining gaming engines with synthetic data for AI training—how cool is that. Deloitte is building more accurate AI models by artificially generating 80 percent of the training data, and American Express is using synthetic financial data to improve fraud detection algorithms.
Diebold: Beyond synthetic data, what other privacy-preserving technologies will be important in the coming years?
Yee: As more than 120 countries have already passed data protection regulations, privacy-preserving technologies (PPTs) or privacy-enhancing technologies (PETs) will only become increasingly more important in the years ahead. These technologies will complement each other to solve different problems and will be a centerpiece to overcoming regulatory, ethical, and social sensitivities around data. Besides synthetic data and differential privacy, I am excited about the following new technologies:
First, blockchain for tracking data provenance, transparency, and non-custodial ownership of people’s personal data. I think blockchain (web3) has the right tools for security and privacy to democratize data.
Likewise, I’m interested in federated learning to train a shared model while keeping all training data local on users’ devices by exchanging AI model parameters instead of the raw data itself. It is well suited for use cases where data is distributed across a high number of stakeholders, such as smartphones, where user privacy is indispensable. It’s less suited for use cases involving using, sharing, or analyzing large amounts of sensitive, centralized data.
Secure multiparty computation allows multiple parties to securely share their data and perform computations on it without actually revealing the individual inputs. Although this technique offers a higher security fidelity than federated learning, it requires expensive cryptographic operations, which results in super high computation costs. Therefore, it is more suitable for a smaller number of participants and basic machine learning models.
Lastly, trusted execution environments are truly a game-changer, in my opinion. They are a step beyond software security and are based on secure hardware enclaves. This means encrypted data in and encrypted data out—all the while establishing data confidentiality, integrity, and attestation of the code or function being run in the enclave itself.