Is Synthetic Data Really an AI Game Changer?
Real Data Maintains, Maybe Not Yet
Data is a core element in scientific and business processes. When organizations collect large sets of information, referred to as big data, this can be used for machine learning, predictive modeling, and other applications requiring advanced analytics. Businesses can tap into big data to improve operations, create customized marketing strategies and improve customer service. This can give companies a competitive edge over those not using big data. Access to big data and advanced analytics is also crucial in scientific fields. When researchers can instantaneously call up almost unlimited data sets, things like solving complex health problems become more feasible.
However, one drawback in being able to access this massive amount of information is restrictions due to ever-increasing data privacy regulations. This is where synthetic data can fill in the gaps.
Synthetic data has the same mathematical and statistical properties as authentic data without putting user privacy at stake. In simple terms, an artificial intelligence (AI) algorithm using a real data set is used to train another algorithm with a synthetic data set. The synthetic data based algorithm looks for patterns and trends and creates a new set of data not related to any real individuals. Then, this synthetic data based algorithm can accurately train machine learning models and make statistics-based conclusions, theoretically protecting personally identifiable information.
Advantages of Synthetic Data
Synthetic data adds a layer to prevent sensitive data from being exposed to companies and individuals. This is particularly useful for industries where high levels of privacy are crucial, such as finance and medical fields.
Imagine a hospital wants to improve a diagnostic system. To create a highly accurate algorithm, the scientific team would need access to very private medical information. Synthetic data creates a bridge to overcome this barrier and aid in legal compliance in data handling.
Here are a couple of real-life examples of business applications:
Companies like American Express and J.P. Morgan apply synthetic financial data to improve fraud detection. Amazon uses synthetic data to improve Alexa’s speech capabilities.
Organizations may not have access to enough data for thorough model training for a variety of reasons. Large synthetic data sets can be created to fill in the gaps. Purchasing synthetic data sets vs real data sets can also be much more affordable, producing budget efficiency and lowering the barrier for entry for AI projects.
Finally, there is a security aspect. Sensitive data does not have to be moved around, which may cause risk such as breaching information in transit.
Sound too good to be true?
If other articles sound like synthetic data is going to solve all of our highly complex computing problems, here are a few things to consider.
Computers are not human. Our brains can compensate for anomalies in expected outcomes. But even with neuromorphic computing emulating the human neural structure, AI cannot predict the unpredictable. Statistics say large pandemics are more likely than we thought. Whether it be a global health crisis or some other unforeseen world-wide tragedy, prediction is a major gap in AI, which is made worse by faulty synthetic data.
Synthetic data have proven they cannot reflect real-world conditions, such as the Covid-19 related supply chain issues. This goes back to the basic premise of computing: Garbage in, garbage out. Inaccurate (in this case, out of date) information fed in can create inaccurate synthetic data. This can lead to faulty decision making.
The COVID-19 crisis revealed another gap behind AI. The technology relies on data-fed advanced analytic processes, including machine learning (ML). A major principle of ML is that behaviors and patterns will likely repeat in the future. With COVID-19 those patterns were shattered. People left central offices to work remotely, and many have stayed. We still have lasting effects on how we travel, and even what products and services are more popular than before. As research firm McKinsey points out, we must fix the analytics model that COVID broke.
Speed of adaptability. Google is filled with articles claiming how AI, supported by synthetic data, is going to fix our supply chain problems. I’d take these pitches with a huge grain of salt. Our radical shift in consumer trends were identified, tracked and documented by McKinsey two years ago. Why is AI still slow to remedy our supply chain problems today?
It should also be pointed out that before the general public was even fully aware of our global health crisis, the best research facilities in the world were deploying artificial intelligence and machine learning to rapidly provide prediction data to shape diagnosis, prognosis, and forecasting the spread of Covid-19. We all lived through that failure, which was blamed on bad datasets, automated discrimination, human failures and a complex global context.
AI workforce gaps. We also do not have a U.S. or global workforce capable of keeping up with advances in AI. For example, AI is thought to be of great importance to fight climate change, but according to a 2022 survey among leaders already engaged in the space, 78% cite insufficient access to AI expertise, 77% have limited access to AI solutions, and 67% face a lack of confidence in AI data and analysis.
Lack of real data. While fields like health care and other scientific research may be willing to share data to find solutions, some industries, like manufacturing, are more siloed. They do not want to share how they have created efficiencies and profitability in their manufacturing processes due to industrial competition. Often there is not sufficient real data to emulate when creating synthetic data.
Bias. One of the biggest problems with using data of any sort is inherent bias. Computers and technology systems don’t create bias, our human interaction with data does. Synthetic data doesn’t eliminate bias, it adopts it from its real data algorithm training. Working with data takes specialized human knowledge and skill, especially in the area of minimizing bias, so that data is not skewed and perpetuated when converted to synthetic data.
As an Amazon Associate, Arvig earns from qualifying purchases.
Can data really be anonymized? Here are two examples where sensitive information like personal identity are still at risk.
Synthetic data is modeled after real people. In a medical application this could include their vital statistics, demographic profiles, medical conditions, treatment results and other information. In large data sets, let’s say a Minneapolis city zip code, synthetic data can closely match the real data while still being anonymous because of the volume of people in the data set. However, datasets created using small sample sizes, such as in rural areas, are not only inaccurate, they lose anonymity. If a person in a sparsely populated area has a unique condition that no one else in their zip code has, their identity would be pretty easy to determine. Currently, these small samplings are often eliminated from datasets, which also eliminates a certain section of the population, and is not truly reflective of all demographics in an area.
Computer algorithms can also be written to de-anonymize data sets. Entertainment company Netflix made public an anonymized list of customer’s favorite movies. Researchers at the University of Texas at Austin were able to de-anonymize Netflix’s list by using information from the public Internet Movie Database (IMDb) website.
Organizations aware of these problems are working to assess just how reliable and accurate synthetic data sets are for their applications. It seems that AI and ML have some growing and refinement to do before adding in the complexity of synthetic data.