Machine learning-accelerated first-principles predictions of the stability and mechanical properties of L1 2 -strengthened cobalt-based superalloys

As promising next-generation candidates for applications in aero-engines, L1 2 -strengthened cobalt (Co)-based superalloys have attracted extensive attention. However, the L1 2 strengthening phase in first-generation Co-Al-W-based superalloys is metastable, and both its solvus temperature and mechanical properties still need improvement. Therefore, it is necessary to discover new L1 2 -strengthened Co-based superalloy systems with a stable L1 2 phase by exploring the effect of alloying elements on their stability. Traditional first-principles calculations are capable of providing the crystal structure and mechanical properties of the L1 2 phase doped by transition metals but suffer from low efficiency and relatively high computational costs. The present study


INTRODUCTION
Ni-based superalloys have been widely used in the aviation, aerospace and petrochemical industries due to their superior combination of highly desirable properties, such as microstructural stability, mechanical properties and oxidation and thermal corrosion resistance at elevated temperatures [1,2] . The signature coherent γ/γ' two-phase precipitate microstructure can maintain the strength of the superalloys under hightemperature conditions [3] . However, due to the limitation of the melting temperature of elemental Ni (1455 °C), the working temperature of current Ni-based superalloys is approaching the upper limit. Thus, it is highly desirable to search for a new class of superalloys that can meet the requirements of next-generation aero-engines with ever-increasing working temperatures. The melting temperature of elemental cobalt (Co) is ~40 °C higher than that of Ni. Therefore, Co-based superalloys are regarded as promising material candidates for next-generation aero-engine applications. Nevertheless, the high-temperature mechanical properties of conventional Co-based superalloys need to be significantly improved because the major strengthening phase is the carbide precipitate instead of the ordered γ' phase (also referred to as the L1 2 phase) that it is really exsist in the Ni-based superalloys. [4,5] . In 2006, Sato et al. first discovered the coherent γ/γ' microstructure in the Co-Al-W-based alloy, thereby opening up new avenues for alloy development [6] . However, the L1 2 phase is metastable and only exists within a narrow composition region [5,7,8] , necessitating the further development of this material. The alloying of transition metals (TMs) has been found to be effective in promoting the precipitation of the stable L1 2 phase [9][10][11] and increasing the solvus temperature and mechanical properties of the L1 2 phase at different levels [12,13] . Using this approach, Co-Al-Mo-X [14][15][16][17] , Co-Ga-W-X [18] , Co-Ge-W-X [19] , Co-Ti-Cr-X [20,21] and several other alloy systems have already been designed, and all of which have a stable γ/γ' two-phase microstructure.
Nevertheless, to explore the high-dimensional composition and temperature space through the alloying strategy, the traditional experimental methods based on trial and error are labor intensive and timeconsuming. In order to guide the design and discovery of new L1 2 -strengthened Co-based superalloys with enhanced mechanical properties, the basic information, such as the crystal structures and atomic occupancies, of the L1 2 phase are highly desirable, which is defined as the site occupied by a doped TM. Through structural optimization and static calculations based on first-principles calculations, the groundstate static energy of the L1 2 phase at 0 K can be accurately calculated and the stable formation enthalpy and reaction energy of the L1 2 phase can then be derived [22,23] . First-principles calculations can also be combined with Hook's law to predict the elastic constant of the supercell of Co-based superalloys, which allows for the prediction of the mechanical properties, such as the bulk, shear and elastic moduli [24,25] . However, the procedures of traditional first-principles calculations are tedious and require significant computational resources. In the case of a system with more than four elements, the number of nonequivalent sites for each element in the supercell will dramatically increase due to the increase in the types of elements, resulting in a significant increase in computational cost and a reduction in computational efficiency. Therefore, improving the computational efficiency to speed up alloy discovery requires an alternative approach [26] .
To date, there has been a push towards big data and artificial intelligence in materials research [27,28] . Machine learning (ML) is a type of algorithm that can acquire new knowledge "automatically" like human beings, mine the existing data, extract key information, establish a predictive model that describes the relationship between influencing factors and a target property and use the model to predict new materials of new unknown systems [26] . ML-based methods have been widely used for assisting the design and discovery of a wide class of materials, including alloys, ceramics and composites, polymers, two-dimensional materials, organic-inorganic hybrids, and so on [29,30] . Using ML algorithms, new materials with excellent performance have been developed successfully and efficiently. However, most of the data used to train the models are collected from experimental studies [31][32][33][34][35][36] . Only a few studies have relied on data from first-principles calculations to train ML algorithms. For example, Guo et al. made efforts to establish and train ML models using the formation energies and lattice constants obtained from first-principles calculations of the Co 3 (Al, X) (X = 3d, 4d or 5d TM, excluding Co and W) precipitate phase [37] . The structures of a new class of Co 3 (Al, WX 3 ) precipitate phases were predicted using the trained ML models, showing the potential of ML in the development of L1 2 -strengthed Co-based superalloys.
To overcome the limitations posed by the inherent low efficiency in predicting the crystal structure and mechanical properties of the L1 2 phase using conventional first-principles calculations, a ML-accelerated first-principles approach is proposed in the present work. First, ML algorithms are established and trained using the data provided by conventional density functional theory (DFT) calculations. A small number of predictions made by these ML models are then validated by the first-principles calculations and the resulting dataset is used for improving the ML models if necessary. Finally, the models are employed to predict the crystal structure and mechanical properties of the L1 2 phase. These predictions may provide a theoretical basis for the design and discovery of new L1 2 -strengthed Co-based superalloys. In particular, it is found that the efficiency of this ML-assisted method is twice as fast as that based on conventional firstprinciples calculations alone.

Iterative three-stage computations
In order to obtain the crystal structure and mechanical properties of the new L1 2 -strengthened Co-based superalloys more efficiently, ML algorithms are combined with first-principles calculations to predict the properties of the superalloys mentioned above in three steps.
Before attempting to use ML algorithms, it is necessary to conduct a detailed analysis of the first-principles calculations to determine the concept of establishing the ML models, as shown in Figure 1. First, the types of TM dopants contained in the supercells are assumed and the relaxed structures of the L1 2 phase and its competing D0 19 phase are calculated through relaxation optimization. Second, the occupation tendency of the TM dopants in the L1 2 and D0 19 phases is evaluated to determine the occupancy that is defined as the site in a supercell occupied by a TM dopant in these two phases. Third, the stabilities of the L1 2 and D0 19 phases are compared in terms of the stable formation enthalpy, followed by the calculation of the mechanical properties for the L1 2 phase if it is more stable than the D0 19 phase.
In this study, we propose a new type of approach for predicting the L1 2 phase crystal structure and mechanical properties based on ML algorithms in new Co-based superalloys in three steps, namely, occupied sites, stability prediction and mechanical property prediction, similar to the procedures of firstprinciples calculations mentioned above. Since the reaction energy and enthalpy of formation between different superalloy systems are incomparable numerically, the classification algorithm in ML should be selected to make a qualitative judgment rather than a quantitative prediction when predicting the occupancy of the doped TM atoms and the stability of the doped L1 2 and D0 19 phases.

Details of first-principles calculations
First-principles calculations are employed to generate data for training the ML model and verifying the ML model predictions, so as to improve the ML model iteratively. The details of the first-principles calculations are briefly summarized below. Generally, first-principles calculations can only deal with a completely ordered phase. If a completely ordered structure can be found and the correlation function of the structure is close to that of a disordered alloy, it is considered that the structure can reflect the configuration of the disordered alloy and the structure is used as the cell model of the disordered alloy in the calculation. The essence of the special quasi-random structure (SQS) method is to find a completely ordered structure to represent the disordered structure by matching the correlation function [38,39] . Therefore, we use the SQS method to construct 2 × 2 × 2 supercells of the Co-based superalloys and consider two types of structures for the Co-Al-W-, Co-V-Ti-, Co-V-Ir-, Co-V-Ta-and Co-Al-V-based systems, namely, the AuCu 3 and Ni 3 Sn prototype structures corresponding to the L1 2 and D0 19 phases, respectively [39,40] (see Figure 2 for the L1 2 and D0 19 structures). In addition, the Alloy Theoretic Automated Toolkit (ATAT) is used to identify the nonequivalent positions in the supercells [41] .
The Vienna Ab initio Simulation Package (VASP) is used to perform all the first-principles calculations with the projector augmented wave (PAW) method [42][43][44][45][46] and Perdew-Burke-Ernzerhoff (PBE) exchangecorrelation functional using the generalized gradient approximation (GGA) [23] . During the structural relaxation, the criteria for the convergence of energy and maximum force are set to be 10 -5 eV/atom and 10 -3 eV/Å, respectively. The kinetic energy cutoff is set to 450 eV. Spin polarization is considered during the calculations because of the presence of the ferromagnetic Co. The Brillouin zones are sampled using 7 × 7 × 7 and 5 × 5 × 9 Γ-centered grids for the L1 2 and D0 19 structures, respectively, which balance the computational accuracy, efficiency and cost.

Reaction and stable formation energies of L1 2 and D0 19 structures
Determining the occupancy of the TM dopants in the L1 2 phase is a vital prerequisite for obtaining an accurate atomic configuration. The occupancy of an alloying element can be evaluated using the binding [23] and formation energies of the impurity [47] . Each system calculated contains three main elements, each of which is designated according to the name of the alloy system. For instance, Co, Al and W are the main elements #1, #2 and #3 in the Co-Al-W system, respectively. In order to discover the role played by each TM element, the reaction energy of the 3d, 4d or 5d TM element occupying sites #1, #2 and #3 in the supercells of each system is calculated as follows [48,49] : where represents the energy of Co 3 (X, Y), denotes the energy of TM-doped Co 3 (X, Y) and µ i and µ TM represent the chemical potential of the i th main and TM elements, respectively. The doping elements are energetically favorable to occupy the position(s) with the lowest reaction energy. Under Co-rich conditions, µ Co denotes the energy of Co in the ground state [50] . Since we choose Co, CoAl, Co 3 W, Co 3 Ti, CoV 3 and Co 3 Ta as reference compounds,µ AL , µ w , µ Ti , µ Ta andµ V are calculated from the following relationships, respectively: The stability of the L1 2 phase is then evaluated by comparing the stable formation enthalpy ΔH S of the TMdoped L1 2 and D0 19 phases, which can be calculated as follows [49,51] : where µ j is the chemical potential of element j.

Elastic properties from first-principles calculations
Elastic properties, such as the bulk (B), shear (G) and elastic moduli (E), can be calculated from the elastic constants, which can be computed according to the stress-strain energy curve method [52][53][54][55][56] . The calculation methods are presented in the Supporting Information (SI).

Dataset
The data for the L1 2 phase in the new Co-based superalloys with TM alloying elements are first generated by first-principles calculations. A total of 61 data from the Co-Al-W-, Co-V-Ti-and Co-V-Ir-based systems are collected for constructing a training set, which are all included in Supplementary Table 1 [49,57] . The characteristics of the data are described briefly as follows: (1) The microscopic characteristics of the elements are used to replace the names of the main and doping elements, including the melting point, boiling point, density, atomic weight, atomic radius, covalent radius, electronegativity and first ionization energy; (2) For the occupancy prediction model, the microscopic characteristics of the main and doping elements are set as X and the occupancy of the doping elements are set as Y in the occupied site prediction models; (3) For the L1 2 phase stability prediction model, the microscopic characteristics of the main and doping elements and the occupancy of the doping elements are set as X and the L1 2 phase stability is set as Y in the stability prediction models; (4) For the mechanical properties of the L1 2 phase prediction model, the microscopic characteristics of the main and doping elements, the occupancy of the doping elements and the L1 2 phase stability are set as X and the mechanical properties are set as Y in the mechanical property prediction models of the L1 2 phase.
There are two research routes of choice: Route I: Predict C 11 , C 12 and C 44 and then calculate the elastic properties, including B, G and E, according to Eqs. (1)-(10) in the SI; Route II: Predict elastic properties, including B, G and E, directly.

ML model selection and performance evaluation
According to the "no free lunch" theory [58] , no algorithm can be applied to all situations, i.e., one algorithm (algorithm A) outperforms another (algorithm B) on a specific data set and therefore algorithm A will be inferior to algorithm B on another specific data set. As a result, a variety of ML algorithms are first employed to predict the crystal structure and mechanical properties of the L1 2 phase, followed by a model performance evaluation and comparison. The algorithm with the best performance is selected for making predictions.
Random forest classification, gradient boosting classification (GBC), AdaBoost classification, a support vector machine, an artificial neural network (ANN), K-nearest neighbor classification and Gaussian process classification are selected to establish the classification models. In contrast, regression models are established using random forest regression, gradient boosting regression, AdaBoost regression, support vector regression, an ANN, K-nearest neighbor regression and Gaussian process regression.
All the ML algorithms are run through Python 3.0 and the sklearn package is used to carry out the calculations. All calculations are performed using a PC (Microsoft Windows 10, Intel Core (TM) i7-10875H, CPU 2.30 GHz, 16 GB of RAM).
The performance of the various ML algorithms mentioned above is compared using the K-fold crossvalidation method. Since the test results of the K-fold cross-validation do not depend on the training set, the occurrence of overfitting can be avoided. The original data set is randomly divided into K equal subsets. One of the subsets is used as the test set, while the remaining ones consist of a new training set. Each subset should be used as a verification data set in turn, i.e., the above process is repeated K times. In this study, K is set to be ten [59,60] .
The performance of a classification model is quantified by the so-called "accuracy", which is the ratio of the total number of samples divided by the number of correct predictions, defined as: where n t and n a represent the total number of samples and the number of correct predictions, respectively. The criteria of accuracy need to be higher than 85%.
In this study, a principal component analysis (PCA) algorithm is also employed to reduce the dimensionality of the data. PCA is a statistical process that uses orthogonal transformation method to convert a series of observations of possible related variables into a set of linear independent variables referred to as principal components. A new feature vector is defined by the following linear transformation: where W T is a matrix with orthonormal columns and has fewer rows than . The first three principal components are used to represent most of the information contained in more than 25 features [60,61] .
Several accuracy metrics, such as the coefficient of determination R, R 2 , mean absolute error (MAE) and root mean squared error (RMSE), were evaluated for the ML algorithms [26,60] : where Y and denote the true and predicted values of the targeted properties, respectively, n is the number size of the data, R value falls between (-1,1) and thus R 2 value falls within (0,1). The closer the value of R is to 1, the better the performance of the model prediction. MAE reflects the true error, while RMSE is more sensitive to outliers. A larger MAE value or a smaller RMSE value indicates that the model is underfitting. The criteria of the R value need to be higher than 0.90, while the MAE and RMSE values are lower than 7.50 and 10.00, respectively.
We evaluated the importance of the features with the relative importance (I r ) to measure the impact of these features on the occupancy of each doping element and the stability and mechanical properties of the L1 2 structure and it is given by: where I T is the importance of the feature calculated by the model and I max is the highest importance calculated by the model among all the features. The values of Ir lie between 0 and 1.

Iterative ML model improvement
The performance of the selected ML algorithms is then iteratively improved through the interaction with the first-principles calculations. First, the selected algorithm is used to predict the target properties for a small amount of randomly chosen input data. Second, the predictions are verified using first-principles calculations. Third, if the accuracy of the models does not meet the requirements, the new data will be used as an additional dataset for re-training the ML model. The procedures above are repeated until the predefined precision is met. The improved models are then employed to predict all the remaining data (the workflow is schematically shown in Supplementary Figure 1).

Predicting dopant occupancy and stability of L1 2 structures
The occupancy of a TM dopant may significantly influence both the stability and mechanical properties of the L1 2 phase in Co-based superalloys [62] . In new Co-based superalloys, the D0 19 phase usually competes against the L1 2 phase [49] . The performance of various ML algorithms for predicting the dopant occupancy and stability of the L1 2 structures are evaluated using 10-fold cross-validation and the results are shown in Figure 3. The gradient boosting algorithm is found to have the highest accuracy (reaching 88.52% and 93.44% for occupancy and stability predictions, respectively) and is thus selected for predicting these two properties. The PCA classification results regarding the effect of TM dopant occupancy and the stability of L1 2 are shown in Figure 3 and their interpretation degrees are 92.05% and 93.44%, respectively. All the parameters of the ML algorithm are shown in Supplementary Table 2. Selection between two routes: Figure 5 compares the performance of the two routes. It can be found that the precision of C 12 is relatively low and its highest R value only reaches 0.6628 in route I. The error of C 11 is relatively large and its lowest MAE and RMSE values are 7.5720 and 10.110, respectively, which are much larger than those of other mechanical property prediction models. The value errors of B, G and E were calculated based on the predicted C 11 , C 12 and C 44 values using Eqs. (7)-(12) will be further enlarged. Therefore, the prediction results of C 11 , C 12 and C 44 are not discussed below.

Feature importance
The relative importance of different features on the dopant occupancy, stability of the L1 2 structures and the mechanical properties of the L1 2 phase are extracted from the gradient boosting classification and AdaBoost regression models, as shown in Figure 6. The names of the features are too long to be directly reflected in the figure and we therefore use codes to represent the full feature names, which are provided in Supplementary Table 3.
The first ionization energy and electronegativity quantify the attraction between atoms and affect the distortion of the supercell, and are thus capable of evaluating the occupancy of a dopant in the supercell [63] . The covalent radius of a dopant affects the stability of the supercell [62] . The melting and boiling points of a dopant and the mechanical properties (such as bulk, shear and elastic moduli [64] ) are correlated. It can be seen from Figure 6A that the values of relative importance for the electronegativity and the first ionization energy of the dopant are the highest, indicating that these two features predominantly determine the occupancy of the doped atom. Similarly, Figure 6B shows that the covalent radius and the first ionization  energy of the dopant determine the stability of the L1 2 phase. Figure 6C indicates that the melting and boiling points of the dopant have the greatest influence on the mechanical properties, including the bulk, shear and elastic moduli.

Application of ML models for Co-based superalloys
The L1 2 phase exists at high temperatures in the Co-Al-W-, Co-V-Ti-and Co-V-Ir-based systems [1,6,65] . Building a new alloy system based on the properties of the major alloying elements is highly desirable. Ta can increase the L1 2 solvus temperature, while V can improve the strength of the alloy [66][67][68] . Herein, the trained ML models are employed to predict the crystal structure and mechanical properties of the L1 2 phase in new alloy systems containing V and Ta elements, such as the Co-V-Ta-and Co-Al-V-based systems. The prediction precision of the ML models without information for the Co-V-Ta-and Co-Al-V-based systems is usually low, so it is necessary to modify the models. The ML model modification precision is shown in Table 1.

Co-V-Ta-and Co-Al-V-based systems
A rule is established where each round of random calculation verifies three data points for evaluating the model performance. In order to verify the prediction capability of the model for an unknown system, the calculated results of the Co-V-Ta-based system are added to the previous trained models as a new training set and the optimized models are used to predict the new Co-Al-V-based system. Through one round of iteration, the accuracy of the ML model for predicting dopant occupancy in the Co-V-Ta-based system is improved from 66.67% to 100%. The accuracy of the prediction in the Co-Al-V-based system reaches 100%, i.e., the model does not need to be modified. In addition, in order to verify the generalization ability of the ML model, we use first-principles calculations to compute the rest of the data that have not yet been verified. The results are compared with those predicted using the improved ML model. The results show that the prediction accuracy is improved from 80.00% to 95.00% for the Co-V-Ta-based system after only one-time model optimization. The accuracy of the Co-Al-V-based system is 95.24%. The PCA classification effect of the model is shown in Figure 7. The interpretation degrees of the Co-V-Ta-and Co-Al-V-based systems are 88.37% and 88.51%, respectively.
The accuracy of the ML model for predicting the L1 2 phase stability in the Co-V-Ta-based system is improved from 66.67% to 100% through a one-round iteration. The accuracy of the prediction in the Co-Al-V-based system reaches 100%, i.e., the model does not need to be modified. As before, we use firstprinciples calculations to compute the rest of the data that have not yet been verified. The verified results  show that the accuracy of model prediction in the Co-V-Ta-based system after one round of iteration is improved from 70.00% to 95.00%. The results show that the model predictions in the Co-Al-V-based system are all correct. The display effect of the PCA classification effect of the models is shown in Figure 8. The interpretation degrees of the Co-V-Ta-and Co-Al-V-based systems are 88.37% and 89.12%, respectively. It can be found that the modified gradient boosting algorithm is capable of making accurate predictions for both the occupancy of TM dopants and the stability of the L1 2 phase for both the Co-V-Ta-and Co-Al-Vbased systems.
The iterative processes for improving the accuracy of the ML for predicting the mechanical property L1 2 phase are shown in Supplementary Figure 4. It can be found that the accuracy of model prediction is significantly improved.
The optimization processes of the ML models for predicting the mechanical properties of the L1 2 phase in the Co-V-Ta-and Co-Al-V-based systems are shown in Supplementary Figures 5 and 6  prediction model is relatively large before the models are modified. In order to verify the generalization ability of the ML model, we calculate all the remaining data and verify the ML prediction results after two rounds of modification. Figure 9 shows the overall prediction results of the modified mechanical performance models of the Co-V-Ta-and Co-Al-V-based systems and their model performances are shown in Figure 10. For the Co-V- Ta Compared with the Co-V-Ta-based system, the modified Adaboost regression models have better prediction performance for the Co-Al-V-based system, which further proves that the ML model is capable of predicting the crystal structure and mechanical properties of the L1 2 phase in new Co-based superalloys.

Comparison of time cost and mechanical properties
It takes about two days for traditional first-principles calculations to compute a data point, while establishing a ML model requires five days. However, it takes less than a minute for the trained ML models to predict the calculation results. By comparing the calculation amount and time between the modified ML models and the traditional first-principles calculations, we find the prediction method based on ML algorithms can improve the calculation efficiency by more than double using the modified ML model, as shown in Table 2.
Comparison of the predicted B, G and E values for the Co-V-Ta-X and Co-Al-V-X systems with those for previous Co-Al-W-X and Co-V-Ti-X systems are shown in Figure 11. It can be seen that the mechanical properties of the Co-V-Ta-X and Co-Al-V-X system are generally higher than those of Co-Al-W-X and Co-V-Ti-X systems, except for the cases with Y, Zr and Re dopants. Using ML algorithms combined with firstprinciples calculations, two new systems (Co-V-Ta-X and Co-Al-V-X) with better mechanical properties than the previous systems are successfully and efficiently proposed.

SUMMARY
This work aims to address the challenges encountered by the traditional experimental approaches and firstprinciples calculation methods for the discovery of new Co-based superalloys (strengthened by L1 2 ordered precipitates), both of which are inefficient, time-consuming and labor-intensive when used alone.  A new approach is proposed that combines machine learning (ML) and first-principles calculations to speed up the prediction of crystal structure, phase stability and mechanical properties for systems, such as Co-V-Ta-and Co-Al-V-based alloys. This information is critical for developing new Co-based superalloys with superior properties at elevated temperatures. ML models are established and trained for predicting the site occupancy, phase stability and mechanical properties. Through iterative interactions between model predictions and validations using first-principles calculations, the ML models are further improved. Finally, the refined models are used to make accurate predictions for the crystal structure and mechanical properties for Co-V-Ta-and Co-Al-V-based systems.
The combination of ML and first-principles calculations may shed light on the rapid prediction of crystal structure and mechanical properties of other advanced materials beyond Co-based alloys.  . Mechanical property comparison of predicted Co-V-Ta-X and Co-Al-V-X systems with data in Ref. [57] for Co-Al-W-X and data in Ref. [49] for Co-V-Ti-X system: (A) bulk modulus; (B) shear modulus; (C) elastic modulus.

Author's contributions
Project conception: Liu X, Wang C Calculation task: Xi S, Yu J Analysis: Xi S, Yu J, Bao L Investigation: Xi S, Yu J, Bao L, Chen L, Li Z, Shi R Draft Preparation: Xi S, Yu J, Shi R Supervision: Liu X