^{*}Correspondence to: Prof. Dezhen Xue, State Key Laboratory for Mechanical Behavior of Materials, Xi'an Jiaotong University, No. 28, West Xianning Road, Xi'an 710049, Shaanxi, China. E-mail:

In addition to being determined by its chemical composition and processing conditions, the performance of a material is also affected by the variables of its service space, including temperature, pressure, and frequency. A rapid means to estimate the performance of a material in its service space is urgently required to accelerate the screening of materials with targeted performance. In the present study, a materials informatics approach is proposed to rapidly predict performance within a service space based on existing data. We utilize an active learning loop, which employs an ensemble machine learning method to predict the performance, followed by a Bayesian experimental design to minimize the number of experiments for refinement and validation. This approach is demonstrated by predicting the damping properties of a ZE62 magnesium alloy in a service space defined by frequency, strain amplitude, and temperature based on the available data for other magnesium alloys. Several utility functions that recommend a particular experiment to refine the estimates of the service space are used and compared. In particular, the standard deviation is found to reduce the prediction error most efficiently. After augmenting the database with nine new experimental measurements, the uncertainties associated with the predicted damping capacities are largely reduced. Our method allows us to forecast the properties in the service space of a given material by rapid refinement of the predictions via experiment measurements.

High-throughput calculations and combinatorial experiments, together with data-driven approaches, are now widely employed to search for new materials with targeted properties in an accelerated manner^{[1-5]}. Such data-driven methodologies, including statistical inference, machine learning, and deep learning, usually serve as means to explore the vast, high-dimensional "material space" with unknown properties^{[6-9]}. These algorithms infer material properties from material descriptors or features, which essentially are functions of chemical compositions and processing conditions^{[10-13]}.

In addition to the intrinsic properties of materials, a variety of environmental factors during the service process affect the performance of a material^{[14]}. The variables within the working environment form the so-called "service space". For example, the service space for ship steel may include temperature and flow velocity, which in turn influence the corrosion rate, whereas, for a superalloy, the variables can include temperature, engine speed, and pressure^{[15,16]}. Only after acquiring the performance in the whole service space can a rational selection of the material be made as the material space is too vast to explore exhaustively and the service space can be complex. The emphasis of current materials informatics approaches has largely been on down selecting or exploring the material space, with few studies having explored the service space systematically and efficiently.

Machine learning offers an approach to address the complexity of both the materials and service spaces^{[17-19]}. Predictive machine learning models map the materials descriptors to performance, and the experimental design selects optimal candidates for experiments to minimize the overall effort^{[20]}. In experimental studies in materials science, the size of the available data is typically small, which often degrades the prediction as the uncertainties are then large^{[21]}. Adaptive sampling provides an efficient means to explore the vast search space and has been utilized to overcome the limitations of small training data sets and large model uncertainties^{[20,22-25]}. Incorporating efficient sampling methods to guide new measurements iteratively refines the service space in the fewest number of measurements.

Here, we propose an active learning approach that employs an ensemble machine learning method to predict the performance in the whole service space and then use Bayesian experimental design to recommend candidates for experiments. Our experimental design suggests an experiment as a function of one variable in the service space. This is in contrast to fixing all the variables to given values, which is the usual approach employed with functions such as Efficient Global Optimization^{[26,27]}.

We demonstrate our approach by predicting the damping capacity of magnesium alloys in their service space. It is known that magnesium alloys exhibit good damping properties due to the easy motion of dislocations and weak pinning effects on dislocations^{[28]}. They have wide applications in structures ranging from aircraft to electrical devices, which usually require noise/vibration reduction and shock absorption^{[29]}. The alloying elements, including Zr, Zn, Cu, Ca, and rare earth elements, form secondary phases, introduce point defects and modify the grain size, which affect the damping properties of the alloy^{[30-32]}. These possible variations in chemistry lead to a vast material space for magnesium alloys. More importantly, the mobility of crystalline defects, such as dislocations and twin boundaries, depends on environmental variables, such as frequency (^{[33-37]}. These variables form the service space for the damping capacity.

We use an ensemble learning model to estimate the damping capacity in the three-dimensional space of ^{[38]}. Six different utility functions are used to recommend experiments to reduce the uncertainties of the predictions. It is found that maximizing the standard deviation of the experiment is the most efficient method. Guided by the utility functions, we iteratively augment the data from nine new experimental measurements and find that the uncertainties associated with the predicted damping capacities are largely reduced. Our approach provides a framework to predict the service space of materials in the search space and allows the predictive algorithm to choose the experiment from which it learns more efficiently with less training data.

Our ensemble learning model is applied to the recently developed as-cast Mg-6Zn-2RE (wt.%) alloy (ZE62). Specifically, ZE62 was prepared from pure Mg (99.99%), pure Zn (99.99%), and rare earth elements (Gd, Nd, Ce, and Y) in a resistance furnace. The 20.00

We trained five supervised machine learning models on samples in the training set to map the material features and variables in the service space to the property. The models included a support vector machine with a radial-based kernel function, random forest, polynomial regression, neural network, gradient boosting decision tree, and extreme gradient boosting. The latter two are tree-based ensemble models, whereas the others are standard supervised models. These models were implemented in the

Details of machine learning methods used in the present study

svr.rbf | e1071 | The kernel function is a radial-based kernel function. The parameter gamma is equal to 1 and the cost is equal to 30 |

rf | randomForest | The number of trees is 500 |

poly | stats | The powers for dif.Esurf, dif.Emelt, and dif.Ymod are 2, 3, and 3, respectively. The powers for strain, temperature, and frequency are 2, 3, and 1, respectively |

nnet | nnet | The size of the neural network is 50 |

gbm | gbm | The number of trees is 300 and the interaction depth is set as 5 |

mxgb | xgboost | The maximum depth of trees is 6 and the evaluation metric is the rooted mean squared error |

Flowchart of our design strategy including prediction and optimization.

A database with known material features (

We built a training dataset containing 769 data points with known damping capacity. The data are from 14 as-cast magnesium alloys. The distribution of different alloying elements in the training data is shown in the radar chart in

Visualization of training dataset for damping capacity of magnesium alloys used. (A). Distribution of different elements in training data. (B). Distribution of samples in training data in the plane of two principal components (PC1 and PC2). The color indicates the damping values.

Both the chemical compositions and environmental variables strongly affect the damping capacity of magnesium alloys. Two sets of independent variables are thus needed to serve as the inputs to the surrogate model, namely, the material features (

The compositions of different elements can potentially be used as material features, but this leads to a high-dimensional feature space, as well as poor interpretation of the surrogate model. More importantly, the model based on chemical composition usually has a poor capability to generalize, especially when there are new elements in the unexplored search space. We thus establish a material features pool based on 12 physical properties of elements, as listed in

Physical properties of elements and material features

Melting point (K) | ave.Tm |

dif.Tm | |

Electronegativity (Martynov and Batsanov) | ave.elgMB |

dif.elgMB | |

Cohesive energy (J/mol) | ave.Ecoh |

dif.Ecoh | |

1st ionization energy (kJ/mol) | ave.1Eion |

dif.1Eion | |

2nd ionization energy (kJ/mol) | ave.2Eion |

dif.2Eion | |

Enthalpy of melting (kJ/mol) | ave.Emelt |

dif.Emelt | |

Enthalpy of surface (Miedema) (kJ/mol) | ave.Esurf |

dif.Esurf | |

Metallic radii (Å) | ave.Rmet |

dif.Rmet | |

Valence electron number | ave.venum |

dif.venum | |

Work function (eV) | ave.wf |

dif.wf | |

Young's modulus (GPa) | ave.Ymod |

dif.Ymod | |

Atomic mass | ave.atmass |

dif.atmass |

As shown in

Pearson correlation and relative influence of features. (A). Graphical representation of Pearson correlation matrix for the 24 material features. Blue and red indicate positive and negative correlations, respectively. The darker the tone and the larger the circle, the more significant the corresponding correlation. (B). Relative influence of features according to gradient boosting, which indicates the impact of features on the property. These features are preselected by Pearson filtering.

The service space variables (^{[28]}. However, there is no general tendency for the damping capacity of magnesium alloys with temperature or frequency^{[31,33]}. Here, we discretize these factors and set up a three-dimensional service space with discretized variable values. The temperature ranges from 273.15 to 373.15 K in steps of 5 K. The frequency varies from 1 to 20 Hz in steps of 1 Hz. The strain amplitude changes from 10

We employ five different machine learning models to estimate the damping capacity, including a support vector machine with a radial basis function kernel (svr.rbf), a random forest regression tree model (rf), a polynomial regression model (poly), a neural network (nnet) and a gradient boosting model (gbm). The original dataset is split into two parts, i.e., 80% for the training set and the remaining 20% for the testing set. The model performance can be visualized by plotting the predicted damping capacity as a function of the measured values.

Performance of machine learning models. The predicted damping capacity is plotted as a function of the measured values. The blue dots represent the training set and the purple dots are for the testing set. (A). Support vector regression with radial basis function kernel (svr.rbf). (B). Random forest regression tree model (rf). (C). Polynomial regression model (poly). (D). Neural network (nnet). (E). Gradient boosting model (gbm). (F). Ensemble learning model of extreme gradient boosting (mxgb). The insets show the mean and standard deviation of the predicted value obtained by the bootstrap resampling method.

Here, we also use the boosting method of ensemble learning, which uses decision trees as base learners and then integrates the outcomes from these learners for a more accurate predictive model. The extreme gradient boosting algorithm (mxgb) is employed and its performance is shown in

We further evaluate the performance of the models by estimating the training and test errors. All data in the dataset are used to train the regression model and obtain the prediction for each sample. The training error is calculated by comparing the prediction (

Performance of different models in terms of training and test errors. (A). Training error of RMSE.train. (B). Test error of RMSE.boots. (C). Test error of RMSE.cv. The ensemble learning model of the extreme gradient boosting (mxgb) outperforms the other models.

The cross-validation and the bootstrap method with replacement are employed to estimate the test error for these models. Bootstrap resampling is commonly used to evaluate the robustness of models. It is implemented by sampling the data with replacement. In the present study, we sample

We used the mxgb model to estimate the damping capacity of unexplored magnesium alloys in the service space of frequency (

Estimated damping capacity of ZE62 alloy in its service.

The central question is which points in the service space in

The first two utility functions consider the uncertainty associated with the damping capacity. We use the bootstrap resampling method to estimate the standard deviation (

where

The next two utility functions consider the influence of how the points in the service space change the model, i.e., the change compared to the current model after augmenting the data of selected candidates. We would like to choose the experiment that can change the model most^{[39]}. The model change (

where

Distance is another consideration for the utility function, which evaluates how "far" the new data point (

For an experiment with several points (

Query by committee is a common approach in active learning and defines a promising candidate as one with the highest deviation amongst the predictions of different learners^{[40]}. The difference between different models is defined as the sum of deviation between the prediction and the mean value from the models:

where

Therefore, in total, we propose six utility functions, namely,

For comparison, a "

We perform seven experiments based on the seven selection criteria and augment the data, as shown in

Error changes for different selectors with increasing iterations. (A). Mean absolute error of selected untested experiments. (B). Relative mean absolute error of selected untested experiments. (C). Predicted value before refinement as a function of measured values. (D). Predicted values after refinement by the utility function of

In summary, we propose a materials informatics approach to rapidly estimate the performance of an alloy within its service space. It employs an ensemble machine learning method to initially predict the performance in the service space and then, for refinement and validation, utilizes Bayesian experimental design to minimize the number of experiments, all within an active learning framework. We use the approach to predict the damping properties of a ZE62 magnesium alloy in the service space of frequency, strain amplitude, and temperature. Several utility functions are employed to recommend a particular experimental curve, and their efficiency in reducing the uncertainties in estimation is compared. The

Methodology, software, investigation, writing - original draft: Shi B

Conceptualization, resources, writing - review & editing: Zhou Y

Resources, writing - review & editing: Fang D

Code checking, validation: Tian Y

Resources, supervision: Ding X

Resources, supervision: Sun J

Conceptualization, visualization, writing - review & editing: Lookman T

Conceptualization, visualization, writing - review & editing: Xue D

The data used in the current study will be available from the corresponding author based on reasonable request.

The authors gratefully acknowledge the support of National Key Research and Development Program of China (2021YFB3802102), National Natural Science Foundation of China (Grant Nos. 52173228 and 51931004) and the 111 project 2.0 (BP2018008).

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Not applicable.

Not applicable.

© The Author(s) 2022.

Aggarwal R, Demkowicz MJ, Marzouk YM. Information-driven experimental design in materials science. In: Lookman T, Alexander FJ, Rajan K, editors. Information science for materials discovery and design. Cham: Springer International Publishing; 2016. pp. 13-44.

10.1007/978-3-319-23871-5_2

Callister, WD, Rethwisch DG. Materials science and engineering an introduction, 10th ed. John Wiley & Sons, Inc.; 2018.

10.1002/maco. 19940451110

Carpentier A, Lazaric A, Ghavamzadeh M, Munos R, Auer P. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In: Kivinen J, Szepesvári C, Ukkonen E, Zeugmann T, editors. Algorithmic learning theory. Berlin: Springer Berlin Heidelberg; 2011. pp. 189-203.

10.1007/978-3-642-24412-4_17

Landkof B. Magnesium Applications in aerospace and electronic industries. In: Kainer KU, editor. Magnesium alloys and their applications. Weinheim: Wiley-VCH Verlag GmbH & Co. KGaA; 2000. pp. 168-72.

10.1002/3527607552. ch28

Cai WB, Zhang Y, Zhou J. Maximizing expected model change for active learning in regression : proceedings of IEEE 13th International Conference on Data Mining; 2013 Dec 7-10; Texas, USA. IEEE; 2013. p. 51-60.

10.1109/icdm. 2013.104

Burbidge R, Rowland JJ, King RD. Active Learning for Regression Based on Query by Committee. In: Yin H, Tino P, Corchado E, Byrne W, Yao X, editors. Intelligent data engineering and automated learning - IDEAL 2007. Berlin: Springer Berlin Heidelberg; 2007. pp. 209-18.

10.1007/978-3-540-77226-2_22