Domain knowledge-guided interpretive machine learning: formula discovery for the oxidation behavior of ferritic-martensitic steels in supercritical water

A general formula with high generalization and accurate prediction power is highly desirable for science, technology and engineering. In addition to human beings, artificial intelligence algorithms show great promise for the discovery of formulas. In this study, we propose a domain knowledge-guided interpretive machine learning strategy and demonstrate it by studying the oxidation behavior of ferritic-martensitic steels in supercritical water. The oxidation Cr equivalent is, for the first time, proposed in the present work to represent all contributions of alloying elements to oxidation, derived by our domain knowledge and interpretive machine learning algorithms. An open-source


INTRODUCTION
The rapid development of materials informatics [1][2][3][4][5] , artificial intelligence (AI) and machine learning (ML) techniques has led to a new paradigm of data-driven discovery of novel materials, state-of-the-art experimental and computational methods and scientific laws and formulas.The number of publications on materials informatics has increased exponentially in the past decade and materials informatics has achieved great success in many areas [6][7][8][9][10] .For example, Xue et al. [11] proposed an adaptive design iteration strategy by tightly coupling ML with experiments, which sequentially identifies the next experiments by using efficient global optimization to balance the trade-off between exploitation and exploration.This adaptive design strategy, also known as active learning, starts from an initial dataset of 22 alloys, runs nine feedback loops in the search space of ∼800,000 alloys and successfully finds very low thermal hysteresis NiTi-based shape memory alloys.With a dataset of electronic density of states (DOS) generated by high-throughput density functional theorybased computations, Fung et al. [12] used a convolutional neural network-based ML model to automatically obtain the key features for the accurate prediction of catalytic properties, such as adsorption energies.The ML model yields a DOSnet, which has the capacity to provide physically meaningful predictions and insights by predicting responses to external perturbations to the electronic structure without additional calculations.
Attia et al. [13] developed an ML methodology to efficiently optimize a parameter space specifying the current and voltage profiles of six-step, 10-min fast-charging protocols for maximizing battery cycle life.They trained an elastic net ML model to predict battery charging/discharging life using data only from the first few cycles and employed a Bayesian optimization algorithm to reduce the number of experiments by balancing exploration and exploitation to efficiently probe the parameter space of charging protocols.With such an approach, they identified and validated high-cycle-life charging protocols among 224 candidates in 16 days.Saito et al. [14] conducted an image process by using U-Net based on a convolutional encoder-decoder network to segment and identify the thickness of atomic layer flakes from optical microscopy images, achieving a success rate of 70-80% in distinguishing monolayer and bilayer MoS 2 and graphene.
ML is achieving remarkable success in materials science and engineering [15,16] and will achieve even greater success if it can become more transparent and interpretive.Theoretically, AI and ML are based on statistics without utilizing any other scientific laws, principles and (physical) equations and most AI and ML algorithms perform as "black-box" systems [17][18][19][20][21][22][23] .Considerable efforts, such as physics-informed neural networks [24] , symbolic regression and Shapley additive explanations (SHAP) [25] , are being carried out to enhance the interpretability of ML models.Obviously, significant further endeavors are required to make ML models interpretive.The strategy proposed in the present work, i.e., domain knowledge-guided interpretive ML, might pave the way for the discovery of mathematical formulas.
In the present work, we propose a domain knowledge-guided interpretive ML strategy to make ML models interpretable and have more physical sense and apply this strategy to the data-driven discovery of formulas regarding the oxidation of ferritic-martensitic (FM) steels in supercritical water (SCW).Although the use of SCW in power plants can achieve enhanced thermal efficiency with simplified plant design and improved safety, it requires high anti-oxidation materials because SCW is a strong oxidant [26] beyond the supercritical point (at 374 • C and 22.1 MPa).FM steels are some of the most promising structural materials for use in SCWcooled power plants, owing to their high elevated temperature strength, high creep resistance, high thermal conductivity, low swelling behavior under irradiation, low thermal expansion coefficient, and low susceptibility to stress oxidation cracking up to 600 • C [27,28] .
The oxidation behavior of FM steels in SCW environments has been investigated extensively through experimental approaches [29][30][31][32][33][34] .The current understanding of the corrosion occurring in high-temperature water environments is associated with the chemistry and physics of the water (density and dielectric constant of the medium).In high-temperature water with a low density/dielectric constant (< 0.1 g/cm 3 / < 10), the direct oxidation reaction occurs between materials and oxidants (oxygen and/or water), while in high-temperature water with a high density/dielectric constant (> 0.1 g/cm 3 / > 10), electrochemical oxidation proceeds via the partial anodic and cathodic reactions [30] .The oxidation behavior of FM alloys in SCW depends on the alloy composition and oxidation environment.Ampornrat and Was [29] experimentally investigated the corrosion behavior of three FM alloys (T91, HCM12 A and HT-9) in SCW at temperatures ranging from 400 to 600 • C with dissolved oxygen concentrations of < 10 ppb (deaerated conditions), 100 ppb and 300 ppb.Cr is conducive to the formation of a continuous inner Cr-Fe oxide layer to block the transportation of Fe (outward) and oxygen (inward); therefore, a higher concentration of Cr generally corresponds to a lower oxidation rate of FM steels in SCW.Furthermore, Si facilitates the formability of Cr-rich areas in FM steels [35] .Dong et al. [36] reported that with increasing Mn content, the oxide scale becomes discontinuous and thicker, and thus Mn might be harmful to the oxidation of FM steels in SCW.
Significant progress has been achieved in the investigation and understanding of FM steel oxidation in various SCW environments [29,30,[37][38][39] , as evidenced by the Arrhenius equation of Δ =  eff exp −  nRT  1  [29] , where Δ per unit area (mg/dm 2 ) denotes the oxidation weight gained,  eff (mg/dm 2 /h) is the effective oxidation rate constant,  is the activation energy of oxidation (J/mol),  is the gas constant (8.314J/mol•K),  is the absolute temperature (K),  is time (h) and 1/ is the exponent over time.
Oxidation is clearly a thermally activated process, and the associated thermodynamics and kinetics are greatly dependent on the material compositions and environmental variables.In experimental investigations, individual researchers adjust only one or a few experimental conditions and the obtained result and formula are valid only for the FM steels and SCW environments and periods tested.To the best of our knowledge, no generalized formula has been established for the description and/or prediction of the oxidation of FM steels with any given alloying elements exposed to various SCW environmental conditions.The present work adopts domain knowledge-guided interpretive ML to discover a generalized formula for the oxidation of FM steels in SCW, which will promote the development of green and safe power plants.In addition to exposure time, the investigated FM steels cover 11 alloying elements, and the studied SCW environments include temperature, dissolved oxygen concentration (DOC) and pressure.

Our domain knowledge of oxidation suggests a dimensionless Arrhenius equation of
(DOC) 0 , where Δ 00 = 1 mg/dm 2 and is the reference weight gain, Δ 0 is a prefactor,  DOC+ (DOC) 0 is a function of the DOC, (DOC) 0 = 1 ppb and is the reference DOC,  is a parameter introduced to account for the oxidation in deaerated SCW and  is the exponent of time.With this Arrhenius equation, the present work integrates interpretive ML algorithms, including SHAP [25] , extreme gradient boosting (Xgboost) [40] and the sure independence screening and sparsifying operator (SISSO) [41] , and more significantly, a newly developed classifier model, the tree classifier for linear regression (TCLR) [42] , to discover a generalized formula from data for the oxidation of FM steels in SCW.Recently, the SHAP algorithm has been widely used to calculate quantitatively the contribution of each feature to a particular task [5,21] .Xiong et al. [5] found that critical SHAP values exist in some features when studying the hardness and ultimate tensile strength of complex concentrated alloys (CCAs) by ML.The critical feature value separates the SHAP values into positive and negative regions.This means that the feature values in the positive/negative SHAP value region improve/impair the mechanical properties of CCAs, thereby providing a straightforward assessment of the design of CCAs with high hardness and strength.Obviously, the application of SHAP, including pure and interaction SHAP values, will further promote the development of materials informatics.The generalized formula established in this study accurately predicts the oxidation behavior of experimental FM steels with different alloying elements in various SCW testing conditions.Figure 1 outlines the domain knowledge-guided interpretive ML strategy, where the hub is the domain knowledge suggested Arrhenius equation.The feature importance of SHAP ℜ   = |  | is applied in feature selection effectively.Outliers are screened out based on a calculated 99.7% confidence interval [Figure 1 A] of the ML model.The oxidation Cr equivalent concentration is introduced based on the interactive decomposition of SHAP values to interpret the joint contributions of alloying elements to the oxidation behavior [Figure 1B].The TCLR algorithm is an efficient classifier to locate samples to leaves with each sharing the same activation energy and/or time exponent [Figure 1C].The SISSO [41] assembles the information from TCLR into a generalized formula [Figure 1D].The Arrhenius equation is the starting point of the domain knowledge-guided interpretive ML strategy, which is a prior suggested based on our knowledge.The Arrhenius equation is also the posterior after the mining and evaluation of the ML algorithms with the experimental data, and thus transfers to a generalized formula.

Dataset
A total of 184 oxidation data of FM steels in SCW are collected from the literature and given at the online Supplementary Information.Every datum in the FM steel oxidation (FMO) dataset includes oxidation caused weight gain in units of mg/dm 2 , testing conditions, exposure time and steel chemical compositions, with a total of four testing features and 11 alloying element features (the balance element Fe is not accounted for), as shown in Table 1.

Xgboost and SHAP values
Xgboost [40] is a powerful tree-based boosting ensemble algorithm.The present work employs the Xgboost algorithm to regress the oxidation data of FM steels in SCW, and the values of hyperparameters involved are optimized by cross-validation and/or a grid search in the hyperparameter space with the open python library scikit-learn [43] .Table S1 at Section 2.1 of the Supplementary Information lists all optimized values of the hyperparameters.
The SHAP algorithm is developed based on game theory [25] .A SHAP value, whether positive or negative, reflects the contribution of a feature to a predicted response in one datum, and the predicted response is given by an ML model.In the present work, SHAP values are calculated with Xgboost models.If there are  features of ( 1 , • • • ,   ) and one predicted response ŷ =  ( 1 , • • • ,   ), the predicted response can be expressed by the SHAP values in an additive manner as where  0 denotes the SHAP value of the tree root in the tree-based Xgboost algorithm and   is the SHAP value of feature The SHAP value   can be decomposed into where the SHAP value  ,  denotes the pure contribution of feature   itself and  , is known as the interaction between features   and   , and accordingly,  ,  and  , are deemed the pure and interaction SHAP values, respectively.Equation (1b) indicates that the contribution from feature   to the predicted response is the sum of its pure contribution plus its joint contributions with other features.Therefore, a comprehensive study of SHAP values will provide insights into the mechanisms of the oxidation of FM steels in SCW.

Integration of SHAP values with domain knowledge
The game theory-based SHAP value is an additive feature attribution method, where the output is a sum of contributions of each input feature [44] .If the contributions of variables to a function are not additive in the original variable space, but additive in a mapped space, the SHAP value will be calculated in the mapped space.
For the oxidation of FM steels in SCW, there are 15 features and each feature contributes to the oxidation weight gain It might be inaccurate to calculate the reasonable SHAP values in the original space.Based on the domain knowledge of oxidation, we take a dimensionless Arrhenius equation of Δ Obviously, the features  and  are explicitly shown in the Arrhenius equation.The contributions from the other features to the oxidation weight gain are via the three parameters of Δ 0 ,  and .Although the contributions of features to the oxidation weight gain are not additive in the original Arrhenius equation, the logarithmic form of ln where  = 184 and  = 15 are the numbers of data and features, respectively.The feature importance ℜ   indicates that the higher the mean of the absolute SHAP values of a feature, the greater the contribution of the feature to the response will be.
The data screening is carried out by evaluating the errors   between the response   and the predicted response ŷ , viz., The errors should follow a normal distribution  ∼  (0, ).If the errors exhibit a nonzero mean, the prediction from an ML model deviates systematically from the real values.In this case, the ML model should be improved.If the errors show a normal distribution with a zero mean and few errors of data located far away from the mean, the 3 criterion is usually taken as the threshold to gauge whether a datum is an outlier.Caution must be used in handling outliers.Repeated tests are strongly suggested to verify the reliability of the outliers and reliable outliers imply a new mechanism.

Feature selection and data screening
A total of 184 data on FM steel oxidation in SCW are collected from the literature and provided in the Supplementary Information.Fifteen features are employed here and categorized into two groups, namely, alloying elements and testing conditions.The feature analysis is carried out within each of the groups.The SHAP values As expected, Figure 2A shows that the lower the value of 1  , the higher the SHAP value of  1  , meaning that in general, the oxidation of FM steels is severer at higher temperatures.In contrast, Figure 2A also shows that at a given temperature, the SHAP value of  also exhibits the anticipated trend, as illustratedin Figure 2B, where the larger the value of ln   0 , the higher the SHAP value of  ln   0 .This result is logical because a longer exposure time increases the oxidation weight gain.Furthermore, the variation in  ln   0 at a given value of ln   0 is induced by other features and will be studied later.
The SHAP value of  DOC behaves interestingly, as shown in Figure 2C, where at the deaerated condition (DOC < 10 ppb), the SHAP value of  DOC changes greatly from −0.84 to 0.17.Once DOC > 10 ppb, the higher the DOC, the larger the SHAP value of  DOC .The shape of the SHAP values of  DOC suggests a logarithm function of  (DOC+𝑟) (DCO) 0 .The result is rational in thermodynamics and kinetics because a higher DOC results in a stronger driving force for oxidation.Figure 2D  and  DOC , ranging from −0.013 to 0.009 except for one point at −0.03 with a pressure of 30 MPa.This indicates that between the pressure range of 20.7-31 MPa, the pressure of SCW has a negligible contribution to the oxidation of FM steels in SCW.
It is somewhat surprising that vanadium plays the most important role among the studied 11 alloying elements in the oxidation of FM steels in SCW, as shown in Figure 2E, where the SHAP value of  V statistically decreases from 0.22 to −0.15 as the V content increases from 0 to 0.3 wt.%.When the V content is higher than 0.19 wt.%, the SHAP values of  V are all negative regardless of the great variation, thereby indicating that the V content should be higher than the critical value in anti-oxidation FM steels in SCW.The performance of silicon in the oxidation of FM steels in SCW is similar to that of vanadium.Figure 2F shows that the critical Si concentration is 0.38 wt.%, beyond which the SHAP values of  Si are all negative.However, for 0.5 wt.% Si, a few SHAP values of  Si appear as outliers, which will be analyzed later.Although chromium is the basic element of FM steel against oxidation, its anti-oxidation performance is unsatisfactory with a SHAP value of  Cr > 0 if its content is less than 9.5 wt.%, as shown in Figure 2G.Nickel has high corrosion resistance to acids and alkalis at higher temperatures [45] and thus alloying Ni with a content higher than 0.13 wt.% into FM steels enhances the oxidation resistance of FM steels in SCW, as shown in Figure 2H.Manganese might be detrimental to the oxidation of FM steels in SCW when its content is high.Figure 2I indicates that the SHAP value of  Mn > 0 if the Mn concentration is higher than 0.9 wt.%.
The feature importance defined in the SHAP method (see Methods), ℜ   , is calculated for all 15 features and plotted in Figure 2J.The three features of the testing conditions, 1  , ln   0 and DOC, rank as the top three in the feature importance, indicating that the oxidation environment and time play the most important roles in the oxidation of FM steels in SCW.In contrast, the influence of SCW pressure in the range of 20.7-31 MPa on the oxidation of FM steels is trivial.In general, the feature importance of alloying elements is lower than the three features of the testing conditions, as shown in Figure 2J.
The feature selection is then conducted by the sequential backward selector wrapped with Xgboost and 10-CV, which yields the three testing features of 1  , ln   0 and DOC, and five alloying elements of V, Si, Ni, Cr and Mn, i.e., the top eight in the feature importance order.It seems that the elements of V, Si and Ni play more important roles than Cr based on the feature importance ranking.As described later in the pure SHAP value analysis, Cr is the major element when its concentration is higher than 11 wt.% against FM steel oxidation in SCW.Furthermore, the errors   between the response   and the predicted response ŷ , i.e.,   =   − ŷ ( = 1, • • • , ), are calculated for the original 184 data and follow a normal distribution with a zero mean and standard deviation of  = 0.35, as shown in Figure S2 in Section 2.3 of the Supplementary Information.However, there are six errors located outside of ±3 of   , meaning that the associated six data are outside the 99.7% confidence interval.These six data are removed from the following analysis and highlighted in the FMO dataset, and hereafter, only the surviving 178 data are used to explore analytic formulas for the oxidation of FM steels in SCW (the details regarding the outliers are given in Section 2.4 of the Supplementary Information).With the selected eight features, the Xgboost model predicts the values of ln Δ Δ 00 for the 178 data.The predicted values are plotted against the real values in Figure 2(K), showing an excellent prediction with a Pearson correlation value of  = 0.988, representing a slight enhancement from the original  = 0.986 with all features.

Pure SHAP values and oxidation Cr equivalent
The SHAP value of each feature is decomposed into its pure SHAP value and the interaction SHAP values, as stated in Eq. (1b). Figure 3A-H   E-I indicates that for a certain feature value, the pure SHAP value scattering is much smaller than the SHAP values.This is an expected result because a pure SHAP value eliminates all interaction SHAP values from its parent SHAP value.If pure SHAP values are calculated from a perfect model of a single function of variables, the pure SHAP value of a feature will correspond to the feature value one-to-one, i.e., for one feature value, there is only one pure SHAP value. are almost the same such that the data are condensed to one point, as shown in Figure 3A.The near perfect result implies that the oxidation of FM steels in SCW is indeed a thermally active process and the pure SHAP value of feature 1  catches the oxidation mechanism.Figure 3B illustrates the pure SHAP values of feature ln   0 , where there are 36 values of ln   0 and the maximum of the pure SHAP values is 0.41 at There are a few reasons causing multiple pure SHAP values at a given feature.The first reason might be experimental errors, which measure the degree of the experimental scattering of repeated tests.The second reason might be attributed to the Xgboost model, which approximately estimates the response from the input feature data rather than a perfect function.The third reason might be the method used to calculate the SHAP values from a tree-based algorithm (Tree-Explainer model).Figure 3C shows that the pure SHAP values of the DOC feature can be expressed by a logarithm function of 0.0682 ln (DOC+1) (DCO) 0 − 0.267, which is plotted by the dashed curve.The logarithm function of pure SHAP requires its argument to be larger than zero but beyond the restriction of a specific  value; hence  = 1 is arbitrarily assigned in the figure, and the real value of  for the oxidation in the deaerated SCW is determined in the following analysis.A similar trend is exhibited in the pure SHAP values of the element features.The outlier suspected points at 0.5 wt.% Si disappear in the pure SHAP values of the Si feature, thereby indicating they are caused by the interaction SHAP values.As expected, the critical values of 0.19 wt.% V, 0.38 wt.% Si, 9.5 wt.% Cr, 0.13 wt.% Ni and 0.9 wt.% Mn remain unchanged in the pure SHAP values.Figure 3D shows that the minimum pure SHAP value of V is −0.122 at its highest concentration (0.3 wt.%) in the dataset, while Figure 3F indicates that the minimum pure SHAP value of Cr is −0.186 at its highest concentration of 12 wt.%.The pure SHAP values based on the available data illustrate that Cr is still the basic alloying element for anti-oxidation FM steels in SCW.
From the pure SHAP value of an individual feature, we defined the joint SHAP value of two features as where [•] represents the real concentration in wt.% of an element.Figure 4A-D shows the joint SHAP values of two features,  CrSi ,  CrMn ,  CrNi and  CrV , respectively, and each of the joint SHAP values of two features contains 178 data.Using the linear approximation and the approach mentioned above, we can then obtain the oxidation Cr equivalent concentration Hereafter, we use one feature of the oxidation Cr equivalent concentration to replace the five element features.Thus, the total number of features is reduced to four, one chemical composition feature and three testing condition features.With the four features, the Xgboost model is retrained with 10-CV.The predictions on the 178 data are plotted in Figure 4E It is important to consider that the oxidation behavior of FM steels in SCW depends on both the composition and microstructure of the steels.The oxidation chromium equivalent concentration is based only on the chemical composition and therefore [ Cr] is a purely compositional representative, which will provide guidance in the design of anti-oxidation FM steels.

Activation energy and time exponents
The oxidation mechanism of FM steels in SCW is embodied in the activation energy and time exponent [29,46] , which are the coefficients of ln Δ Δ 00 versus features 1  and ln   0 , respectively.Figure S3 in Section 2.5 of the Supplementary Information shows the plots of ln Δ Δ 00 versus 1  and ln Δ Δ 00 versus ln   0 , respectively, indicating multiple values of ln Δ Δ 00 for a given value of 1  or ln   0 due to the contributions of other features.To investigate the effects of other features on the activation energy and time exponent, a novel "divideand-conquer" approach is developed based on a tree model and hence named TCLR.Maximizing the information gain in conventional tree classifiers determines the splitting feature and the associated splitting

value. TCLR is developed based on that the response 𝑦
are called the linear-features.The linear-feature set is a subset of the feature set ( 1 , • • • ,   ).A TCLR tree is a binary classification tree, where only the linear relationship between the response and one linear-feature is classified by using all features ( 1 , • • • ,   ) in splitting.In the present work, there are two linear-features of  ĵ=1 = 1  and  ĵ=2 = ln   0 and one response with four features.Thus, two TCLR trees are developed and in each of TCLR trees, the absolute change in Pearson correlation coefficient (Other metrics are embedded for nonlinear functions, as demonstrated in section 1 of SI), called the Linearity Goodness (LG), between function on nodes is taken as a generalized loss function.The split feature and split value are determined, from n data with m features by maximizing the gain of average LG in the consequent nodes.For example, if the correlation coefficients in a parent node and two child nodes are denoted by   ,  1 and  2 , respectively, the gain of average LG is defined by Δ|| = 1 2 | 1 +  2 | − |  | and the split feature and split value   are determined from The TCLR tree must be pruned, otherwise, many leaves contain only two data per leaf, which destroys the model generalization considerably.A threshold of data number might also be introduced to prune the tree, i.e., the amount of data in a leaf should not be smaller than a pre-set threshold (default minsize = 3).Another threshold for the LG is introduced to pre-prune the tree.When the LG in a node is equal to or higher than a pre-set threshold, the node will become a leaf.In general, as the value of the LG threshold increases, the model complexity of our procedure tends to improve and the tree tends to overfit.The opposite behavior occurs as the LG threshold decreases.In the bias-variance trade-off, a lower value of the LG threshold corresponds to a larger value of the number threshold of data.Thus, TCLR has the function of data screening by discarding data, which cannot be described linearly (functionally) for some of the resulting leaves in such a framework.
In the present work, the LG threshold is pre-set to be 0.90 for both ln Δ Δ 00 versus 1  and ln Δ Δ 00 versus ln   0 .The details of the TCLR algorithm are given in Supplementary Information.
The entire feature space is estimated from the regions of the four features, i.e., [ Cr] ∈ [10.38, 30.319] wt.%,  ∈ [673.15,973.15]K,  ∈ [30, 2000] h and DOC ∈ [0, 8000] ppb.The TCLR separates the entire feature space into many subdomains and each subdomain corresponds to one leaf, as shown in Figure S4, Table S2 and Table S3 in Supplementary Information.For ln Δ Δ 00 versus 1/, the results indicate that 17 leaves pass the LG threshold and two fail, while for ln Δ Δ 00 versus ln   0 , the leaves that pass and fail are 28 and 7, respectively.The amount of data in each passed leaf ranges from 3 to 30 for ln Δ Δ 00 versus 1/ and from 3 to 10 for ln Δ Δ 00 vs.
ln   0 , indicating that reliable values of activation energy Q and time exponent  will be statistically obtained by TCLR.The ideal case for the measurement of activation energy Q (or time exponent ) is that the tests are carried out at various temperatures (or times) for a certain FM steel under other fixed testing conditions.The TCLR realizes such ideal cases from the complex data by partitioning the whole feature space of FM steels into many small rectangle domains, with one domain for a passed leaf.The samples on each passed leaf may have the same feature or varied feature values, but the variation must be too small to influence the measured activation energy Q (or time exponent m).On each of the failed leaves, the feature values of samples vary greatly so that they will not follow the same mechanism of oxidation, or it is too narrow to cover enough data subject to the capacity of the FMO dataset.One passed leaf and one failed leaf are taken as examples to illustrate the typical cases, as shown in Figure 5A-D.5B, where the datum with  = 2000 h makes the LG not qualified.Although there are two data in Figure 5B that share the same feature values except for temperature, the two data do not allow to stay in a leaf because the threshold of data number is three.Figure 5D shows a failed leaf, where except for time, two data share the same feature values and another shares the same feature values.For the same reason that two data do not allow to stay in one leaf, the four data are restricted on a leaf with a low LG value.
The TCLR yields the values of activation energy Q and time exponent  as functions of the four features.The activation energy Q varies from 43.38   The SISSO algorithm [41] , with minimization of mean absolute percentage error (MAPE, see Supplementary Information) is carried out to find analytic expressions of activation energy Q and time exponent  as functions of the four features of (6d) The analytic formula of Eq. (5d) has strong predictive power, as shown in Figure 6, with a fitting performance of  = 0.91 and MAPE = 6.17%.There are three data that cannot be accurately predicted by the formula, which fall outside of the 3 error range, and the fitting performance will be improved to  = 0.94 and MAPE = 5.65% if these three data are excluded.The three steels share the same [ Cr] = 24.255wt.%, DOC = 10 ppb, testing to 600, 1000 and 1500 h and coincidentally arrive at the upbound of testing temperature (923.15K) on domain   .This may lead to the slightly underfitting of Eq. (5d) on them.In addition, the graphical user interfaces of the Xgboost prediction model were developed and are available at the end of this manuscript.

CONCLUDING REMARKS
In this study, we develop a domain knowledge-guided interpretive ML strategy and demonstrate it by the discovery of the generalized formula for FM steel oxidation in SCW.The domain knowledge suggests the generalized Arrhenius oxidation formula of Δ Δ 00 = Δ 0 Δ 00 exp −     0   (DOC+) (DCO) 0 , based on the fact that the oxidation is a thermally active process and the collected data are conducted at various temperatures.The generalized Arrhenius formula also includes a power law of exposure time to generalize all potential time dependences in the complicated oxidation process including diffusion and phase transformation.The developed TCLR algorithm is scientifically significant to materials informatics because it captures linear relationships between tasks and features.It is expected that when an original feature space is mapped to a high-dimensional space, the TCLR algorithm is able to capture linear relationships in the high-dimensional space.More affords are needed to further develop the TCLR algorithm.
The generalized Arrhenius oxidation formula has very high prediction accuracy with a Pearson correlation coefficient  of 0.91 and a MAPE of 6.17% validated by the 176 experimental data.It is expected that more experiments will be conducted and verify the generalized Arrhenius oxidation formula for oxidation of FM steels in SCW.

Figure 1 .
Figure 1.Domain knowledge-guided interpretive ML strategy.A: Feature selection with ℜ  and data screening through 3 criterion.B: Novel oxidation Cr equivalent of FM steels derived from joint contributions of elements.C: A prototype "divide-and-conquer" algorithm, TCLR, proposed for capturing the influences of features on time exponents and activation energy.D: Formulas of crucial terms derived by the SISSO under the constraint of prior domain knowledge.
all features and errors   =   − ŷ ( = 1, • • • , 184) are calculated with the Xgboost model evaluated by 10-CV.Figure 2(A)-(D) shows the SHAP values of the four testing condition features, while Figure 2(E)-(I) illustrates the SHAP values of the five alloying elements of V, Si, Cr, Ni and Mn.The SHAP values of the other six alloying elements are shown in Figure S1 in Section 2.2 of the Supplementary Information, where the blue points are calculated SHAP values and the red dashed lines indicate the SHAP value of  0 = 5.45, which serves as a reference baseline through the origin of the vertical SHAP value axis.

Figure 2 .
Figure 2. SHAP analysis of features.A-D: SHAP values of testing conditions, i.e., temperature, time, DOC and pressure.E-I: SHAP values of alloy compositions, i.e., V, Si, Cr, Ni and Mn.J: Feature importance ranking by ℜ  .K: Fitting effect of Xgboost model with selected eight features for 10-CV.
shows the pure SHAP values of the selected eight features and there are 178 pure SHAP values in each figure.A comparison of Figure 3A-H to the corresponding Figure 2A-C and Figure

Figure 3 .
Figure 3. Pure SHAP value analysis of important features.A-C: Pure SHAP values of testing conditions, i.e., temperature, time and DOC.D-H: Pure SHAP values of alloy compositions, i.e., V, Ni, Cr, Si and Mn.

Figure
Figure 3A shows almost ideal pure SHAP values of feature 1  .There are [0.90, 0.93], [0.83, 0.93], [1.06, 1.11], [0.12, 0.18], [−0.12, −0.09], [−0.40, −0.38] and [−1.84, −1.76] pure SHAP values at 1  = 0.123598, 0.130292, 0.137753, 0.14612, 0.15557, 0.166327 and 0.178681, respectively, and the pure SHAP values at a given value of 1 are almost the same such that the data are condensed to one point, as shown in Figure3A.The near perfect result implies that the oxidation of FM steels in SCW is indeed a thermally active process and the pure SHAP value of feature1  catches the oxidation mechanism.Figure3Billustrates the pure SHAP values of feature ln   0 , where there are 36 values of ln   0 and the maximum of the pure SHAP values is 0.41 at , showing a perfect fitting with  = 0.985.With the Xgboost model, we calculate the pure SHAP value of the oxidation Cr equivalent concentration,  [ C],[ C] and plot it in Figure 4F.The pure SHAP value  [ C],[ C] is almost linearly proportional to the oxidation Cr equivalent concentration [ Cr] with a correlation of  = −0.97. Figure 4F shows that when [ Cr] > 19.29 wt.%, the pure SHAP values  [ C],[ C] are all negative.

Figure 4 .
Figure 4. Joint SHAP value of two features and the derived oxidation Cr equivalent concentration.A-D: Joint SHAP value of two features, i.e., Cr with Si, Mn, Ni and V, respectively.E: Predicted values of Xgboost model versus experimental values with the transferred four features.F: Pure SHAP analysis of oxidation Cr equivalent concentration feature.

Figure
Figure 5A shows one passed leaf for ln Δ Δ 00 versus 1/ and the feature values, where four data all have the same DOC = 8 ppb, three data have [ Cr] = 17.27 wt.% and one datum with [ Cr] = 18.27 wt.%, two data with  = 40 h and two data with  = 50 h.The linear fitting yields the statistically reasonable activation energy  = 47.38 kJ/mol.Figure 5C shows one completely ideal passed leaf for the linear fitting of ln Δ Δ 00 versus ln   0 , yielding time exponent  = 0.412, where all five data have the same values of [ Cr] = 12.95 wt.%, DOC = 200 ppb and  = 823.15K.There are four data on the failed leaf shown in Figure5B, where the datum with  = 2000 h makes the LG not qualified.Although there are two data in Figure5Bthat share the same feature values except for temperature, the two data do not allow to stay in a leaf because the threshold of data number is three.Figure5Dshows a failed leaf, where except for time, two data share the same feature values and another shares the same feature values.For the same reason that two data do not allow to stay in one leaf, the four data are restricted on a leaf with a low LG value.

Figure 5 .
Figure 5. Data located on the leaf of TCLR.(A), (B) One passed leaf and one failed leaf on TCLR of activation energy.(C), (D) One passed leaf and one failed leaf on TCLR of time exponents.
The oxidation chromium equivalent concentration[ C] = [Cr] + 40.3[V] + 2.3[Si] + 10.7[Ni] − 1.5[Mn]is, for the first time, developed to quantitatively represent the contributions of alloying elements of Cr, V, Si, Ni and Mn to the oxidation of FM steels in SCW.

Figure 6 .
Figure 6.Predicted values of Eq. 5(d) versus experimentally measured values.Each dot represents an FM sample, and the dots on the dotted line indicate the equation predicted values are consistent with experimental observations.The light blue region covers 3 error range, where Eq. (5d) fitted extremely well on.

Table 1 . Fifteen features of FM steel oxidation data
with SHAP values in the present study to explore the oxidation mechanism of FM steels in SCW.For simplicity, we let  = ln Δ Δ 00 ,  0 = ln Δ 0 Δ 00 ,  1 = 1  ,  2 = ln   0 ,  3 = DOC and other features   (  = 4, • • • , 15).An Xgboost model is first developed with all features via ten-fold cross-validation (10-CV) and is used to calculate all SHAP values    , where subscripts "" and " " denote sample ( = 1, • • • , ) and feature   (  = 1, • • • , ), respectively.The mean |  | of the absolute SHAP values for each feature Δ Δ 00 = ln Δ 0 Δ 00 + ln  DOC+ (DOC) 0 −   +  ln   0 exhibits the additive behavior and is coherently integrated indicates that the magnitude of the SHAP values of  pressure is one or two orders smaller in comparison with  1  , ) which measures the joint contribution of two features.In general,    is a function of features   and   .As described above, the present industrial practice still employs Cr as the basic alloying element in manufacturing anti-oxidation FM steels in SCW.Therefore, Cr is taken as the basic feature   and V, Si, Ni and Mn are treated as the partner features   in the calculation of the joint SHAP value.A linear function is proposed to approximately estimate the interaction function, i.e.,    (  ,   ) ≈   0 +  0    +  0    .The two coefficients  0  and  0  represent the contribution weights of features   and   , respectively, to their interaction function.Then letting  0  ≡ 1 introduces an equivalent coefficient   =  0  / 0  for feature   and named it   equivalent coefficient of   .With the approach, we calculate the Cr equivalent coefficient for each of V, Si, Ni and Mn and then sum up the four Cr equivalent coefficients to yield the oxidation Cr equivalent concentration, [ Cr], to 139.18 kJ/mol in the region  Q = {[ Cr] ∈ [10.38, 30.319] wt.%, DOC ∈ [0, 8000] ppb,  ∈ [673.15,973.15]K and  ∈ [30, 2000] h}, where region  Q is the union set , , DOC and [ Cr], which yields Cr],  and DOC are all dimensionless as pre-divided by the reference values of 1 wt.%, 1 K and 1 ppb, respectively, and the MAPE = 12.97% for  SISSO and MAPE = 39.13% for  SISSO .It is interesting that feature  does not show up in both  SISSO and  SISSO , meaning that time does not statistically play an important role in activation energy Q and time exponent .As expected, the higher the oxidation Cr equivalent concentration [ Cr], the larger the activation energy and the smaller the time exponent.The effects of DOC and  on Q and  are not completely clear in Eqs.(5a) and (5b) and deserve more systematic investigations.As mentioned above, the pure SHAP values of the DOC feature suggest that the oxidation kinetic in logarithmic space yield in the form of ln Δ Δ 00 = ln Δ 0 Δ 00 + ln DOC+ (DCO) 0 −   +  ln   0 .The intersection of region  Q and region   ,   =  Q   , covers   = [ Cr] ∈ [10.38, 30.319] wt.%,  ∈ [673.15,923.15]K,∈ [30, 2000] h and DOC ∈ [0, 8000] ppb , which contained 176 data in the FMO dataset.Two data in [ Cr] = 17.0882 wt.% are excluded in the following analysis due to their high testing temperature  = 973.15K,which are out of the region   .With Eqs.(5a) and (5b), the values of −   and  ln   0 are estimated on the whole region   and consequently, the values of ln Δ 0 Δ 00 + ln DOC+ (DCO) 0 are obtained on the entire region   by using the experimental data ln Δ Δ 00 .Letting  = Δ 0 Δ 00 ( DOC+ (DCO) 0 ) and plotting  versus DOC under the condition of DOC → 0 yield slope Δ 0 Δ 00 and intersection Δ 0 Δ 00  (DCO) 0 , and hence the estimated value of  (DCO) 0 equal to 2.17.The calculation details are given in Section 2.8 of the Supplementary Information.To have an analytic expression of ln Δ 0 Δ 00 , the SISSO algorithm is conducted again, which yields − √  + DOC + 0.98([ Cr] − DOC  ) ln([ Cr] + DOC) + 8.543.