管家婆免费开奖大全

When it comes to training AI models, bigger datasets may not always be better: 管家婆免费开奖大全 study

鈥淲e need to pay attention to the information richness, rather than just gathering as much data as we can鈥澛
a wide view of a data center

A new study by researchers at 管家婆免费开奖大全 Engineering suggests that models trained on relatively small datasets can perform well if the data is of high enough quality (photo by Jasmin Merdan/Getty Images)

A new study by researchers at the 管家婆免费开奖大全 suggests that one of the fundamental assumptions of deep learning artificial intelligence models 鈥 that they require enormous amounts of training data to make accurate predictions 鈥 may not be as solid as once thought.   

Jason Hattrick-Simpers, a professor in the in the Faculty of Applied Science & Engineering, and his team are focused on the design of next-generation materials 鈥 from catalysts that convert captured carbon into fuels to non-stick surfaces that keep airplane wings ice-free.  

Their findings, , stemmed from efforts to navigate a key challenge in the field: the enormous potential search space. For example, the  contains more than 200 million data points for potential catalyst materials 鈥 which still only covers  a tiny portion of the vast chemical space that could, for example, yield the right catalyst to help us address climate change.  

鈥淎I models can help us efficiently search this space and narrow our choices down to those families of materials that will be most promising,鈥 says Hattrick-Simpers.  

鈥淭raditionally, a significant amount of data is considered necessary to train accurate AI models. But a dataset like the one from the Open Catalyst Project is so large that you need very powerful supercomputers to be able to tackle it. So, there鈥檚 a question of equity 鈥 we need to find a way to identify smaller datasets that folks without access to huge amounts of computing power can train their models on.鈥  

This leads to a second challenge: many of the smaller materials datasets currently available have been developed for a specific domain 鈥 for example, improving the performance of battery electrodes. In other words, the data tend to cluster around a few chemical compositions similar to those already in use while missing more promising possibilities that may be less obvious.  

鈥淚magine if you wanted to build a model to predict students鈥 final grades based on previous test scores,鈥 says Kangming Li, a postdoctoral researcher in Hattrick-Simpers鈥 lab.  

鈥淚f you trained it only on students from Canada, it might do perfectly well in that context, but it might fail to accurately predict grades for students from France or Japan. That鈥檚 the situation we are up against in the world of materials.鈥  

One possible solution is to identify subsets of data from within very large datasets that are easier to process, but which nevertheless retain the full range of information and diversity present in the original.  

To better understand how the qualities of datasets affect the models they are used to train, Li designed methods to identify high-quality subsets of data from previously published materials datasets, such as JARVIS, The Materials Project, and the Open Quantum Materials Database (OQMD). Together, these databases contain information on more than a million different materials.  

Li built a computer model that predicted material properties and trained it in two ways: one used the original dataset, but the other used a subset of that same data that was approximately 95 per cent smaller.   

鈥淲hat we found was that when trying to predict the properties of a material that was contained within the domain of the dataset, the model that had been trained on only 5 per cent of the data performed about the same as the one that had been trained on all the data,鈥 Li says.  

鈥淐onversely, when trying to predict the properties of a material that was outside the domain of the dataset, both of them did similarly poorly.鈥  

Li says that the findings suggest a way of measuring the amount of redundancy in a given dataset: if more data does not improve model performance, it could be an indicator that those additional data are redundant and do not provide new information for the models to learn.   

鈥淥ur results also reveal a concerning degree of redundancy hidden within these highly sought-after large datasets,鈥 Li adds.    

The study underscores what AI experts from many fields are now discovering:  that even models trained on relatively small datasets can perform well if the data is of high enough quality.  

鈥淎ll this grew out of the fact that in terms of using AI to speed up materials discovery, we鈥檙e just getting started,鈥 says Hattrick-Simpers.  

鈥淲hat it suggests is that as we go forward, we need to be really thoughtful about how we build our datasets. That鈥檚 true whether it鈥檚 done from the top down, as in selecting a subset of data from a much larger dataset, or from the bottom up, as in sampling new materials to include.  

鈥淲e need to pay attention to the information richness, rather than just gathering as much data as we can.鈥 

The Bulletin Brief logo

Subscribe to The Bulletin Brief

Engineering