A general formula with high generalization and accurate prediction power is highly desirable for science, technology and engineering. In addition to human beings, artificial intelligence algorithms show great promise for the discovery of formulas. In this study, we propose a domain knowledge-guided interpretive machine learning strategy and demonstrate it by studying the oxidation behavior of ferritic-martensitic steels in supercritical water. The oxidation Cr equivalent is, for the first time, proposed in the present work to represent all contributions of alloying elements to oxidation, derived by our domain knowledge and interpretive machine learning algorithms. An open-source tree classifier for linear regression algorithm is also, for the first time, developed to materialize the formula with collected data. This algorithm effectively captures the linear correlation between compositions, testing environments and oxidation behaviors from the data. The sure independence screening and sparsifying operator algorithm finally assembles the information derived from the tree classifier for linear regression algorithm, resulting in a general formula. The general formula with the determined parameters has the power to predict, quantitatively and accurately, the oxidation behavior of ferritic-martensitic steels with multiple alloying elements exposed to various supercritical water environments, thereby providing guidance for the design of anti-oxidation steels and hence promoting the development of power plants with improved safety. The present work demonstrates the power of domain knowledge-guided interpretive machine learning with respect to the data-driven discovery of physics-informed formulas and the acceleration of materials informatics development.

The rapid development of materials informatics ^{[1-5]}, artificial intelligence (AI) and machine learning (ML) techniques has led to a new paradigm of data-driven discovery of novel materials, state-of-the-art experimental and computational methods and scientific laws and formulas. The number of publications on materials informatics has increased exponentially in the past decade and materials informatics has achieved great success in many areas ^{[6-10]}. For example, Xue ^{[11]} proposed an adaptive design iteration strategy by tightly coupling ML with experiments, which sequentially identifies the next experiments by using efficient global optimization to balance the trade-off between exploitation and exploration. This adaptive design strategy, also known as active learning, starts from an initial dataset of 22 alloys, runs nine feedback loops in the search space of ^{[12]} used a convolutional neural network-based ML model to automatically obtain the key features for the accurate prediction of catalytic properties, such as adsorption energies. The ML model yields a DOSnet, which has the capacity to provide physically meaningful predictions and insights by predicting responses to external perturbations to the electronic structure without additional calculations.

Attia ^{[13]} developed an ML methodology to efficiently optimize a parameter space specifying the current and voltage profiles of six-step, 10-min fast-charging protocols for maximizing battery cycle life. They trained an elastic net ML model to predict battery charging/discharging life using data only from the first few cycles and employed a Bayesian optimization algorithm to reduce the number of experiments by balancing exploration and exploitation to efficiently probe the parameter space of charging protocols. With such an approach, they identified and validated high-cycle-life charging protocols among 224 candidates in 16 days. Saito ^{[14]} conducted an image process by using U-Net based on a convolutional encoder-decoder network to segment and identify the thickness of atomic layer flakes from optical microscopy images, achieving a success rate of 70–80% in distinguishing monolayer and bilayer MoS_{2} and graphene.

ML is achieving remarkable success in materials science and engineering ^{[15, 16]} and will achieve even greater success if it can become more transparent and interpretive. Theoretically, AI and ML are based on statistics without utilizing any other scientific laws, principles and (physical) equations and most AI and ML algorithms perform as "black-box" systems ^{[17-23]}. Considerable efforts, such as physics-informed neural networks ^{[24]}, symbolic regression and Shapley additive explanations (SHAP) ^{[25]}, are being carried out to enhance the interpretability of ML models. Obviously, significant further endeavors are required to make ML models interpretive. The strategy proposed in the present work, i.e., domain knowledge-guided interpretive ML, might pave the way for the discovery of mathematical formulas.

In the present work, we propose a domain knowledge-guided interpretive ML strategy to make ML models interpretable and have more physical sense and apply this strategy to the data-driven discovery of formulas regarding the oxidation of ferritic-martensitic (FM) steels in supercritical water (SCW). Although the use of SCW in power plants can achieve enhanced thermal efficiency with simplified plant design and improved safety, it requires high anti-oxidation materials because SCW is a strong oxidant ^{[26]} beyond the supercritical point (at 374 ℃ and 22.1 MPa). FM steels are some of the most promising structural materials for use in SCW-cooled power plants, owing to their high elevated temperature strength, high creep resistance, high thermal conductivity, low swelling behavior under irradiation, low thermal expansion coefficient, and low susceptibility to stress oxidation cracking up to 600 ℃ ^{[27, 28]}.

The oxidation behavior of FM steels in SCW environments has been investigated extensively through experimental approaches ^{[29-34]}. The current understanding of the corrosion occurring in high-temperature water environments is associated with the chemistry and physics of the water (density and dielectric constant of the medium). In high-temperature water with a low density/dielectric constant (^{[30]}. The oxidation behavior of FM alloys in SCW depends on the alloy composition and oxidation environment. Ampornrat and Was ^{[29]} experimentally investigated the corrosion behavior of three FM alloys (T91, HCM12 A and HT-9) in SCW at temperatures ranging from 400 to 600 ℃ with dissolved oxygen concentrations of ^{[35]}. Dong ^{[36]} reported that with increasing Mn content, the oxide scale becomes discontinuous and thicker, and thus Mn might be harmful to the oxidation of FM steels in SCW.

Significant progress has been achieved in the investigation and understanding of FM steel oxidation in various SCW environments ^{[29, 30, 37-39]}, as evidenced by the Arrhenius equation of ^{[29]}, where

Oxidation is clearly a thermally activated process, and the associated thermodynamics and kinetics are greatly dependent on the material compositions and environmental variables. In experimental investigations, individual researchers adjust only one or a few experimental conditions and the obtained result and formula are valid only for the FM steels and SCW environments and periods tested. To the best of our knowledge, no generalized formula has been established for the description and/or prediction of the oxidation of FM steels with any given alloying elements exposed to various SCW environmental conditions. The present work adopts domain knowledge-guided interpretive ML to discover a generalized formula for the oxidation of FM steels in SCW, which will promote the development of green and safe power plants. In addition to exposure time, the investigated FM steels cover 11 alloying elements, and the studied SCW environments include temperature, dissolved oxygen concentration (DOC) and pressure.

Our domain knowledge of oxidation suggests a dimensionless Arrhenius equation of ^{[25]}, extreme gradient boosting (Xgboost) ^{[40]} and the sure independence screening and sparsifying operator (SISSO) ^{[41]}, and more significantly, a newly developed classifier model, the tree classifier for linear regression (TCLR) ^{[42]}, to discover a generalized formula from data for the oxidation of FM steels in SCW. Recently, the SHAP algorithm has been widely used to calculate quantitatively the contribution of each feature to a particular task ^{[5, 21]}. Xiong ^{[5]} found that critical SHAP values exist in some features when studying the hardness and ultimate tensile strength of complex concentrated alloys (CCAs) by ML. The critical feature value separates the SHAP values into positive and negative regions. This means that the feature values in the positive/negative SHAP value region improve/impair the mechanical properties of CCAs, thereby providing a straightforward assessment of the design of CCAs with high hardness and strength. Obviously, the application of SHAP, including pure and interaction SHAP values, will further promote the development of materials informatics.

The generalized formula established in this study accurately predicts the oxidation behavior of experimental FM steels with different alloying elements in various SCW testing conditions. ^{[41]} assembles the information from TCLR into a generalized formula [

Domain knowledge-guided interpretive ML strategy. A: Feature selection with

A total of 184 oxidation data of FM steels in SCW are collected from the literature and given at the online Supplementary Information. Every datum in the FM steel oxidation (FMO) dataset includes oxidation caused weight gain in units of mg/dm

Fifteen features of FM steel oxidation data

Alloying elements | Cr | Chromium (wt.%) |

Si | Silicon (wt.%) | |

Mn | Manganese (wt.%) | |

C | Carbon (wt.%) | |

Ni | Nickel (wt.%) | |

Mo | Molybdenum (wt.%) | |

Nb | Niobium (wt.%) | |

W | Tungsten (wt.%) | |

V | Vanadium (wt.%) | |

P | Phosphorus (wt.%) | |

Cu | Copper (wt.%) | |

Testing conditions | T | Absolute temperature (K) |

Pressure | SCW pressure (MPa) | |

t | Exposure time (h) | |

DOC | Dissolved oxygen concentration (ppb) |

Xgboost ^{[40]} is a powerful tree-based boosting ensemble algorithm. The present work employs the Xgboost algorithm to regress the oxidation data of FM steels in SCW, and the values of hyperparameters involved are optimized by cross-validation and/or a grid search in the hyperparameter space with the open python library scikit-learn ^{[43]}. Table S1 at Section 2.1 of the Supplementary Information lists all optimized values of the hyperparameters.

The SHAP algorithm is developed based on game theory ^{[25]}. A SHAP value, whether positive or negative, reflects the contribution of a feature to a predicted response in one datum, and the predicted response is given by an ML model. In the present work, SHAP values are calculated with Xgboost models. If there are

where

where the SHAP value

The game theory-based SHAP value is an additive feature attribution method, where the output is a sum of contributions of each input feature ^{[44]}. If the contributions of variables to a function are not additive in the original variable space, but additive in a mapped space, the SHAP value will be calculated in the mapped space. For the oxidation of FM steels in SCW, there are 15 features and each feature contributes to the oxidation weight gain

It might be inaccurate to calculate the reasonable SHAP values in the original space. Based on the domain knowledge of oxidation, we take a dimensionless Arrhenius equation of

An Xgboost model is first developed with all features via ten-fold cross-validation (10-CV) and is used to calculate all SHAP values

where

The data screening is carried out by evaluating the errors

A total of 184 data on FM steel oxidation in SCW are collected from the literature and provided in the Supplementary Information. Fifteen features are employed here and categorized into two groups, namely, alloying elements and testing conditions. The feature analysis is carried out within each of the groups. The SHAP values

SHAP analysis of features. A-D: SHAP values of testing conditions, i.e., temperature, time, DOC and pressure. E-I: SHAP values of alloy compositions, i.e., V, Si, Cr, Ni and Mn. J: Feature importance ranking by

As expected,

The SHAP value of

It is somewhat surprising that vanadium plays the most important role among the studied 11 alloying elements in the oxidation of FM steels in SCW, as shown in ^{[45]} and thus alloying Ni with a content higher than 0.13 wt.% into FM steels enhances the oxidation resistance of FM steels in SCW, as shown in

The feature importance defined in the SHAP method (see Methods),

The feature selection is then conducted by the sequential backward selector wrapped with Xgboost and 10-CV, which yields the three testing features of

The SHAP value of each feature is decomposed into its pure SHAP value and the interaction SHAP values, as stated in Eq. (1b).

Pure SHAP value analysis of important features. A-C: Pure SHAP values of testing conditions, i.e., temperature, time and DOC. D-H: Pure SHAP values of alloy compositions, i.e., V, Ni, Cr, Si and Mn.

There are a few reasons causing multiple pure SHAP values at a given feature. The first reason might be experimental errors, which measure the degree of the experimental scattering of repeated tests. The second reason might be attributed to the Xgboost model, which approximately estimates the response from the input feature data rather than a perfect function. The third reason might be the method used to calculate the SHAP values from a tree-based algorithm (Tree-Explainer model).

From the pure SHAP value of an individual feature, we defined the joint SHAP value of two features as

which measures the joint contribution of two features. In general,

where

Joint SHAP value of two features and the derived oxidation Cr equivalent concentration. A-D: Joint SHAP value of two features, i.e., Cr with Si, Mn, Ni and V, respectively. E: Predicted values of Xgboost model versus experimental values with the transferred four features. F: Pure SHAP analysis of oxidation Cr equivalent concentration feature.

Hereafter, we use one feature of the oxidation Cr equivalent concentration to replace the five element features. Thus, the total number of features is reduced to four, one chemical composition feature and three testing condition features. With the four features, the Xgboost model is retrained with 10-CV. The predictions on the 178 data are plotted in

The oxidation mechanism of FM steels in SCW is embodied in the activation energy and time exponent ^{[29, 46]}, which are the coefficients of

The TCLR tree must be pruned, otherwise, many leaves contain only two data per leaf, which destroys the model generalization considerably. A threshold of data number might also be introduced to prune the tree, i.e., the amount of data in a leaf should not be smaller than a pre-set threshold (default minsize

The entire feature space is estimated from the regions of the four features, i.e.,

Data located on the leaf of TCLR. (A), (B) One passed leaf and one failed leaf on TCLR of activation energy. (C), (D) One passed leaf and one failed leaf on TCLR of time exponents.

The TCLR yields the values of activation energy Q and time exponent

The SISSO algorithm ^{[41]}, with minimization of mean absolute percentage error (MAPE, see Supplementary Information) is carried out to find analytic expressions of activation energy Q and time exponent

Note that

As mentioned above, the pure SHAP values of the DOC feature suggest that the oxidation kinetic in logarithmic space yield in the form of

To have an analytic expression of

Putting all the analytic expressions together gives

The analytic formula of Eq. (5d) has strong predictive power, as shown in

Predicted values of Eq. 5(d) versus experimentally measured values. Each dot represents an FM sample, and the dots on the dotted line indicate the equation predicted values are consistent with experimental observations. The light blue region covers

In this study, we develop a domain knowledge-guided interpretive ML strategy and demonstrate it by the discovery of the generalized formula for FM steel oxidation in SCW. The domain knowledge suggests the generalized Arrhenius oxidation formula of

The oxidation chromium equivalent concentration

The developed TCLR algorithm is scientifically significant to materials informatics because it captures linear relationships between tasks and features. It is expected that when an original feature space is mapped to a high-dimensional space, the TCLR algorithm is able to capture linear relationships in the high-dimensional space. More affords are needed to further develop the TCLR algorithm.

The generalized Arrhenius oxidation formula has very high prediction accuracy with a Pearson correlation coefficient

Performed the research, analysed data, wrote the programmers, and drafted the manuscript: Cao B

Collected the data. Tong-Yi Zhang and Ziqiang Dong supervised the project: Yang S, Sun A

Studied and revised the manuscript more on the oxidation mechanism: Dong Z

Designed the study, performed the research, analysed data, revised, and finalized the manuscript: Zhang TY

Discussed the results: Cao B, Yang S, Sun A, Dong Z, Zhang TY

All experimental data collected in the study are contained in Supplementary Information (FM steel oxidation dataset) and are also available at: https://github.com/Bin-Cao/TCLRmodel. The ML methodology described in the present work was implemented in Python. Source codes of the programmers and algorithms are available at: https://github.com/Bin-Cao/TCLRmodel.

This work was sponsored by the National Key Research and Development Program of China (No. 2018YFB0704400), Key Program of Science and Technology of Yunnan Province (No. 202002AB080001-2), Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02), and Shanghai Pujiang Program (Grant No. 20PJ1403700).

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Not applicable.

Not applicable.

© The Author(s) 2022.

Supplementary Materials

Lundberg SM, Lee SI. A unified approach to interpreting model predictions, Proceedings of the 31st international conference on neural information processing systems, 2017, pp. 4768-4777. Available from: https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html [Last accessed on 20 Apr 2022]

Chen T, Guestrin C. Xgboost: A scalable tree boosting system; proceedings of the Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, F, 2016.

10.1145/2939672.2939785

Zhang TY, Cao B, Zhang SY, Sun S. Tree-classifier for linear regression software [No. 2021SR1951267], 2021. Available from: https://register.ccopyright.com.cn/ [Last accessed on 20 Apr 2022]

Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python, the journal of machine learning research 12 (2012) 2825-2830. Available from: https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf?ref=https://githubhelp.com [Last accessed on 20 Apr 2022]

Lundberg SM, Erion GG, Lee SI. Consistent individualized feature attribution for tree ensembles.

10.48550/arXiv.1802.03888]