Structure-property modeling scheme based on opti- mizedmicrostructural information by two-point statis- tics and principal component analysis

Construction of the structure-property (SP) relationship is an important tenet during materials development. Optimizing microstructural information is a necessary and challenging task in understanding and improving this linkage. To solve the problem that the experimentalmicrostructureswith a small size usually fail to represent the entire sample structure, a data-driven scheme integrating two-point statistics, principal component analysis, and machine learning was developed to reasonably construct a representative volume element (RVE) set from the small microstructures and extract optimized structural information. Based on the elaborate quantitative metrics and method, this kind of RVE set was successfully constructed on an experimental microstructure dataset of ferrite heat-resistant steels. Moreover, to remove redundant information included in two-point statistics, the critical threshold of the tolerance factor related to the coherence length in microstructures was determined to be 0.005. An accurate SP linkage was finally established (mean absolute error < 6.28MPa for yield strength). This scheme was further validated on two other simulated and experimental datasets, which proved that it can offer scientific nature, reliability, and universality compared to traditional strategies. This scheme has a bright application prospect in microstructure classification, property prediction, and alloy design.


INTRODUCTION
A great acceleration in target prediction and alloy design can be realized through materials informatics including multitudinous advanced data-driven technologies and theories related to materials science, which has received extensive attention in recent years [1][2][3] . Establishing process-structure-property (PSP) linkages represents a recognized core task for achieving this ambitious goal. Indeed, significant efforts have been made to pursue such an accurate and universal linkage [3][4][5] .
Traditionally, materials development is largely completed by a mix of Edisonian approaches and serendipity, which can extract experiential process-property (PP) relationships from existing experimental data, and then studies to understand and explain the dominant mechanism leading to the expectations or serendipity through investigating microstructural features [6] . Using an informatics strategy that is different from the traditional experiment methods simply guided by physical metallurgy knowledges, we previously demonstrated the improvement effect of microstructural information in predicting the hardness of austenite steels by comparing PP and PSP linkages [7] . Similarly, Molkeri et al. proposed a novel microstructure-aware framework for materials design and rigorously confirmed the importance of microstructure information in alloy design [8] . One can find that the focus of attention on microstructure has gradually shifted from providing scientific explanations to practically promoting the forward and reverse process of PSP. Therefore, it is necessary to quantificationally extract microstructure information [9][10][11][12][13] . Generally, some physical parameters based on statistical average (phase fraction, grain size, etc.) are used to simply characterize the microstructural features of materials, which nevertheless ignores the correlation and heterogeneity of these features in terms of spatial distribution. In addition, considering the entire discrete and highly nonlinear micrograph data as partial input of a PSP linkage, one may be at risk of dimensional disaster due to the inapplicability of some common dimension reduction algorithms [14] . Therefore, a challenge to be addressed is how to extract sufficient and effective microstructure information including correlation and heterogeneity of spatial features in a quantitative and low-dimensional manner.
Some admirable efforts have been made to overcome this challenge. Sangid et al. used crystal plasticity simulations to identify the stress concentration around pores of various sizes and quantify the pore with the smallest size that results in a debit in the fatigue performance of IN718 alloy [15] . Zinovieva et al. proposed a multi-physics methodology combining physically based cellular automata to simulate the grain structure evolution [16] . They successfully uncovered the effects of scanning pattern on the microstructure and elastic properties of 316L austenitic stainless steel prepared by powder bed-based additive manufacturing. It is noted that, although the two works mentioned above used advanced simulation methods to investigate the PSP relationship, the high computational cost and high-dimensional data analysis process were not completely avoided. Popova et al. developed a data-driven workflow and applied it to a set of synthetic AM microstructures obtained using the Potts-kinetic Monte Carlo (kMC) approach [17] . They finally correlated process parameters in the kMC approach with the predicted microstructures. The chord length distributions method used in their workflow addresses the quantification of the grain size and shape distributions and their anisotropy in a microstructure. However, other important microstructural features, such as the volume fraction of the phase of interest, fail to be extracted by this method [18][19][20][21] . Fortunately, Fortunately, a rigorous quantitative framework based on the -point statistics method has been developed to capture the statistical information of microstructure [22][23][24][25] . As the basis of the −point statistics, the one-point statistics can reflect the probability density (i.e., volume fraction) of finding a specific discrete local state of interest at any randomly selected single point (or voxel) in a microstructure. Two-point statistics, a higher-order measurement, can capture the probability associated with finding an ordered pair of specific local states at the head and tail of a vector that is randomly thrown into a microstructure. These statistics have been proven to contain unbiased and completed structural information [26,27] . However, an experimentally obtained microstructure with a small size always includes limited structure information and cannot be used as a representative volume element (RVE); it thus can hardly be associated with the macroscopic mechanical properties. One may emphasize a compromised scheme of com-bining chemical compositions and experimental conditions to fill the gaps [7,9] . Unfortunately, this indirect strategy cannot essentially eliminate the statistical error caused by the small size of microstructures. Another surrogate solution is to approximate the statistics of an RVE by averaging that of multiple subdomains of the entire sample based on the assumption of statistical homogeneity. It is conceivable that more details of structure will be captured by these subdomains [a single domain refers to a statistical volume element (SVE) in this study] with a higher resolution. Niezgoda et al. proposed a novel concept called the RVE set consisting of a certain number of SVEs with the minimum size [28] . Through accessing the convergence of a quantitative metric (root mean square error between individual statistics of SVE and the two-point statistics of a priori RVE or the average statistics of the overall SVEs), they successfully constructed such an optimal RVE set so that the distribution and dispersion of structural features match the entire material sample. This scheme has the advantages of saving time and computing resources for predicting mechanical properties by finite element analysis. Nevertheless, it is not applicable to establish PSP linkages in a real experimental situation without a prior RVE. The existing reports thus intuitively averaged the statistics of several SVEs to extract the maximum amount of structural information [29][30][31][32] . Therefore, it is necessary to explore new and universal methods for optimizing microstructure information to construct an RVE set and thus build a more reliable PSP linkage.
The effective information of two-point statistics is compressed in the central area after centralized transformation, leading to the statistics of an RVE containing a great deal of redundancy in the area with a large length of [7,33,34] . Determining the boundary of these two areas is beneficial for analyzing and understanding structure features, especially for the features related to length scales such as average grain size. Through an example of Al-alloy matrix composites, Tewari et al. found that numerous length parameters that characterize spatial heterogeneity and clustering of SiC particles can be extracted from two-point statistics [35] . Niezgoda et al. further defined a concept named coherence length, , which is mathematically expressed as where ℎℎ ′ represents the two-point cross-correlation statistics for the two local state ℎ and ℎ ′ of the th members in an RVE set [22] . ℎ and ℎ ′ are their one-point statistics, namely volume fraction. ⟨·⟩ denotes the ensemble average operation. The statistics in the area with of length longer than are considered as redundancy information. It can be imagined that the value of will obviously change if is of a different magnitude, leading to an inaccurate measurement of the length scale associated with the structural features of interest. High dimensional redundant data may introduce unnecessary noise and impede the modeling of PSP linkage. However, there is no accurate reference value for this tolerance factor , and the existing studies are based on intuition to truncate redundancy [14,30,34,36,37] . Thus, determining the threshold of is also one of the important issues in optimizing microstructural information.
Principal component analysis (PCA) [38] , a popular dimensionality reduction algorithm, can effectively address the above challenge of high-dimensional data and has been widely applied to many fields, such as grain coarsening [39] , microstructure evolution during creep [31] , nonmetallic inclusions in steels [37] , etc. Interestingly, one can project statistical features of microstructures into a PCA space and compare them using some common distance metrics such as Euler distance [40][41][42][43] , which provides a potential solution for optimizing microstructural information by constructing an RVE set and removing redundancy. More importantly, low-dimensional features of microstructures obtained by PCA can be input into machine learning (ML) models to establish high-fidelity PSP linkages [13,[44][45][46][47][48] .
In the present study, we developed a new scheme to build a more reliable structure-property (SP) linkage by optimizing microstructural information, which can be extended to a higher-ordered PSP linkage in the future.
Taking an example of ferrite heat-resistant steels, we performed a series of experiments and built a small dataset. This kind of steel has become one of the main materials for the heavy and thick components of advanced ultrasupercritical (A-USC) power plants due to its high thermal diffusivity and low cost [49] . Significant efforts have been made to understand the SP linkage and improve mechanical properties at a high temperature (650 • C) for the steels [50][51][52][53] . Using PCA and two-point statistics, we propose a new method and metric to construct an RVE set from the small SVEs of the steels. We also explored the effect of different redundancy-truncation levels of two-point statistics on the established ML model and determined the acceptable threshold of the tolerance factor . The reliability and generalization ability of this scheme were also proved by two other datasets including experimental data of Ni-Fe-based superalloys collected by Zhong et al. [54] and simulated data by phase field method (PFM), respectively.

Materials preparation
Five alloys were prepared using the raw metals with purity higher than 99.99% by smelting, followed by casting into ingots of ≈ 40 g. The chemical compositions of the alloys are listed in Table 1. The samples were homogenized for 16 h at 1100 • C with subsequent air-cooling. Hot rolling was then performed at 1100 • C five times, each time holding for 10 min (60% final deformation). Heat treatment was achieved by austenitizing at 1100 • C for 0.5 h with posterior air cooling. The samples were then aged at 750 • C for 12 h, followed by air cooling. The microstructures of these alloys were characterized by optical microscopy (OM, Olympus P4000). Hightemperature tensile tests at 650 • C were performed on a TSMT EM6.504 universal testing machine with a strain rate of 10 −3 −1 . It is noted that the heat treatment was performed at 750 • C, and no phase transition occurred at 650 • C, so the microstructures could remain stable at 650 • C for a long time. Thus, the microstructures at room temperature were used to establish linkage with the yield strength at 650 • C.

Data preprocessing
All microstructures were binarized by an image processing technique named Otsu' s threshold processing [55] . This technique is a nonparametric and unsupervised method of automatic threshold selection for picture segmentation. It selects a threshold automatically from a gray level histogram, and the threshold is equal to the one specific pixel value ( ), which is determined by the maximum variance of the foreground and background pixels [56] . Moreover, the discrete microstructures were transformed to a uniform dimension using the transform.rescale(·) function in the scikit-image library to ensure that one pixel corresponds to an actual size of 0.6504 , as shown in Figure 1. It can be seen that the austenite phase and ferrite phase in the original microstructure are well separated by black and white pixels, and the noise points (gray texture) are completely eliminated.
To eliminate the impact of data magnitude differences on the performance of the model, before establishing the SP linkage by a ML model, all of the microstructural features (low-dimensional representativeness of the microstructures) were normalized using = ( − )/ , where and denote the mean and variance, respectively.

Extracting microstructural information
The circular sample, as displayed in Figure 2A, may rotate during OM characterization, leading to inconsistent statistics of the same microstructure in different reference frames. To filter out the dependence of the statistics  on the observer reference frame, we employed rotationally invariant two-point statistics (RI2SS) to capture the important structural details [23] . For a certain local state ℎ (ferrite phase, ℎ = 0; austenite phase, ℎ = 1), the microstructure function ℎ representing the volume fraction of local state ℎ in the location of should be calculated firstly. The two-point statistics is mathematically expressed as where is the discretized vector placed in microstructure and | | denotes the total number of valid trials associated with discrete vector . In this work, we only calculated two-point autocorrelation statistics of the black phase in the microstructure, as shown in Figure 2B, which can be obtained when ℎ = ℎ ′ . Notedly, the RI2SS of the microstructure is further calculated, as shown in Figure 2C and D. For convenience, all of the statistics are referred to as autocorrelations. The peak value of Figure 2C represents the volume fraction of the target phase, and the main spatial features (average size and shape distribution of the phase, etc.) of the microstructure are contained in the central area that is enlarged in Figure 2D. In addition, the invariant value in the blue area is approximately equal to the square of the peak and represents redundant information.
PCA was used to reduce the dimensionality of autocorrelations. One can obtain principal component scores (PCs), i.e., low-dimensional features of a microstructure, through projecting its autocorrelations into a new space supported by several orthogonal basis vectors. The vectors are ordered and selected according to their explained variance that reflects the main variation of the samples. Mathematically, the original autocorrelations can be reconstructed by where 11,( ) represents the autocorrelations of the th microstructure.¯= 1 =1 11,( ) , where¯and denote the ensemble average and number of all of the autocorrelations. ( ) and 11 represent the th PCs of the th member and the th basis vector, respectively. is the dimensionality of autocorrelations. As the main parameter, participates in the subsequent analysis and modeling.

Modeling and evaluation
We used a classical ML regression model, Ridge regression [57] , to build the SP linkage. By imposing the penalty ∥ ∥ 2 2 , Ridge can solve some problems of ordinary least squares. Mathematically, the objective function of Ridge is to minimize a penalized residual sum of squares: where and are the inputted features and outputted yield strength in this study. is the complexity parameter that controls the amount of shrinkage: the larger is the value of , the greater is the amount of shrinkage. Thus, the coefficients become more robust to collinearity. represents the coefficients of . The Ridge models in this work were trained by calling the scikit-learn library in Python 3.7 [58] . All of the models keep the default hyperparameters.
The performance of these models was quantified by root mean square error ( ) and determined coefficient ( 2 ), which are given as where and are the experimental and predicted yield strength, respectively, and¯= 1 =1 denotes the average of samples. The smaller the is, and the closer 2 gets to 1, the higher the prediction accuracy. In addition, the leave-one-out cross-validation (LOOCV) approach was employed to evaluate and 2 .

SCHEME
We propose a data-driven scheme for building SP linkage including five modularity: dataset preparation, data preprocessing, microstructural information extraction, microstructural information optimization, and SP linkage construction [ Figure 3]. All parameters and corresponding explanations are listed in Table 2. The details of applying this scheme to the Ferrite steels are as follows: 1) Creating experimental dataset  2) Preprocessing microstructure and property data by Otsu' s threshold processing and normalization operation mentioned above.
3) Extracting quantitative information of all microstructures by RI2SS.

4)
Optimizing microstructural information to represent the structural features in the whole sample for each alloy. This procedure includes two sub-paths labeled by the colored arrows in the orange box in Figure 3: a) Constructing RVE set (confirming the size and number of the included SVEs). We randomly selected different numbers of SVEs with a certain size to form a subset, calculated their average autocorrelations (simple arithmetic average of the autocorrelations of the all SVEs) ( , ), and then projected all possible into a PCA space to obtain corresponding low-dimensional features . The two distances ( , and , ) expressed by Equation (5) and (6) were next calculated, and the convergence along different (interclass) and (intraclass) was assessed to confirm the optimal size * of SVEs for constructing an RVE set. It is easy to understand that the locations of the SVEs with larger than * will be clustered in the PCA space.
where , = max{ (1) , , · · · , ( ) , }, , = min{ (1) , , · · · , ( ) , }. The smaller , is, the closer the position of the autocorrelations of the SVEs in PCA space is to that of their ensemble average. A similar relationship applies to , , except that the object of comparison becomes the target autocorrelations that are obtained by averaging autocorrelations of the large domains with a size of 5 . It is noted that these large domains here were selected because they contained sufficient structural features that are independent of their location in the sample, as shown in Figure 1.
We also propose a novel method called "recursive addition" to confirm the optimal number * * of SVEs for constructing an RVE set. When one gradually introduces a new SVE, the average autocorrelations will also include more and more structural information. If the diversity of the structural features in these SVEs is saturated enough to match the entire material sample, the locations of the ( * , ) will gather in a small area in the PCA space. In other words, the distance between two adjacent points in the space will stabilize around a sufficiently small value. This distance can be mathematically expressed as * − −1 * b) Removing redundant information of the autocorrelations. We truncated the autocorrelations in the constructed RVE set by controlling the different maximum lengths of and then observed the variation of their variance in PCA space to explore the threshold of defined by Equation (1). Finally, the structural information that contains the least redundancy is retained. 5) Establishing SP linkage. By inputting the low-dimensional features of the RVEs for the five alloys, we trained a Ridge regression model and assessed its accuracy in predicting yield strength. It is noted that this process was also used to validate the reliability of the methods proposed in Procedure (4).
This scheme shown in Figure 3 was also performed on the dendrite solidification data from PFM for Al-Cu alloys and experimental data of Ni-Fe-based superalloys. The reliability and generalization ability of the scheme were also considered in this study.

Construction of RVE set
Following the workflow shown in Figure 3, we traversed all possible combinations with the variation of and in the SVE pool, calculated their average autocorrelations ( , ), and then employed PCA to extract their low-dimensional features , where ∈ , ≤ , and ∈ . It is noted that all of the features were grouped into five clusters according to the different sizes of SVEs, and their distributions in the PCA space are shown in Figure 4. It can be observed that, for a small ( 50 or 100 ), the larger is, the more concentrated the distribution of sample points are and the further away is from 3 5 , indicating that more structural features are included, but still not enough to match that of a larger microstructure. For a large ( 5 , 10 , or 20 ), all of the points appear to be centrally distributed in a small region, which demonstrates that the structural diversity included in the extracted information represents a saturation.
To quantify the interclass and intraclass convergence observed above, we used Equation (5) and (6) to calculate the normalized average distances, and . Figure 5A explains the calculation principle of the single distance ; the red star point named target is associated with 3 5 . The variation of these two distances with the size and number of SVEs is given in Figure 5B. It can be seen that quickly declines and gradually converges in the range of less than 0.1 as the size of SVEs increases. Thus, the minimum size of SVE that can be used to construct an RVE set is determined as * = 20 . When > 20 , the decrease rate of is first fast, then slow, and finally approaches 0, indicating that the structural information contained in the SVEs reaches saturation. However, what needs to be emphasized is that the calculation for is based on a premise of the SVE pool, which is inconsistent with the requirement of low cost and the fact that there are only several SVEs in experiments. Therefore, the volume of the RVE set needs to be determined separately.  As for the volume of the RVE set, we propose a novel method named recursive addition based on Euler distance in the PCA space. Figure 6B explains the rationality of the method by using a defined distance, * − −1 * , which is mathematically expressed by Equation (7). Here, * represents the optimal size of the SVEs in the RVE set, while * and −1 * are the PC features of the th and ( − 1)th SVEs added gradually. It is easy to understand that * − −1 * will gradually decrease and eventually stabilize in an acceptable range when the new members are added continuously, as shown in the green shaded area in Figure 6B. We started from the five SVEs with the size of 20 for the alloys, then ran PCA on the autocorrelations of these microstructures to  quickly declines followed by a slow reduction. The distances for the alloys finally converge to less than 0.05. As pointed from the vertical arrows shown in the Figures, we determined to use six members to construct an RVE set, and the structural features contained in the set can consistently represent that of the whole sample.

Construction of structureproperty linkage
We then employed Ridge regression to extract SP linkage. The inputs of the model are the low-dimensional features ( 6 20 ) of the constructed RVE set for the five experimental alloys, and the output is yield strength. LOOCV was used to assess the prediction accuracy. For comparison, we also built three other Ridge models using different sets of inputs ( 2 20 , 10 20 , and 3 5 ) obtained from the average autocorrelations of 2 and 10 SVEs with size of 20 and 3 SVEs with size of 5 , respectively. To avoid randomness, the selection procedure of these SVEs was repeated 100 times. Figure 7A and B exhibits the distribution of 2 20 , 6 20 , 10 20 (hollow points), and 3 5 (solid points) in the PCA space. It can be observed that there is always a hollow point occupying a position further away from the solid point for each alloy. After verification, we found that this isolated point is associated with 2 20 , which is consistent with the results shown in Figure 6. In Figure 7C, it can be seen that the accuracies of the models with the input of 6 20 , 10 20 , and 3 5 are extremely close to each other ( 2 are 0.8430, 0.8680, and 0.8652, respectively), which is improved by at least 28.57% compared with the model inputting 2 20 . Figure 7D compares the experimentally measured yield strength and the ones predicted by the model inputting 6 20 from the RVE set. The diagonal distribution between them also indicates the high accuracy of the model. The mean absolute error (MAE) is less than 10 MPa (embedding subgraph in Figure 7D), which demonstrates that the structural information contained in our constructed RVE set can be mapped to the macroscopic mechanical property of the whole sample.
To verify the generalization ability of the proposed method in constructing the RVE set, we performed this method on a dataset of dendrite solidification of Al-Cu alloys simulated by PFM [59][60][61][62][63] . The parameters of PFM are listed in Table S1. The dataset includes 48 microstructures that are produced by controlling solidification parameters including the number of primary grains ( ), anisotropy coefficient of the solid-liquid interfacial energy ( 4 ), and nucleation supercooling (Δ ). From the results shown in Figure S1, it can be observed that the difference of these microstructures comes from the grain morphology and volume fraction of the solid phase. RI2SS and PCA were then employed to extract their average autocorrelations and low-dimensional features. The distribution of the features is shown in Figure S2. Combined with the results of Figures S1 and S2, we demonstrated that the microstructures distinguished by Δ and 4 placed along PC1 and PC2 vectors, respectively, indicating that the first two PCs reflect the variation of volume fraction and grain morphology. Through applying the recursive addition method, as shown in Figure S3, an RVE set consisting of six SVEs was constructed. The established PS linkage shown in Figure S4 also reveals the reliability of this RVE set in extracting sufficient structural features. More importantly, the successful application of the proposed method on the simulation dataset proves its credibility and universality.

Identification of Redundant statistics
Two-point statistical autocorrelations contain valuable information concentrated in the central area and vast redundant information in the peripheral area. Following the procedures shown in Figure 3, we truncated the autocorrelations of the microstructures in the RVE set, as displayed in Figure 8A. The maximum modulus of the vector is labeled as | | . Using the truncated autocorrelations with a certain | | (20-100 pixels) for the five alloys, we created a PCA space and projected these statistics into the space, and then examined the variation of PC variance, as shown in Figure 8B and C. As | | decreases, the cumulative variance of the first three PCs does not change significantly and that of the first two PCs increases slightly. When | | reduces from 100 to 50 pixels, the individual variance of PC1 declines slowly and that of PC2 rapidly rises. When | | continues to be reduced, their tendencies reverse. Different from the first two PCs, the trend of individual variance of PC3 is inconsistent with that of | | . Combining with Equation (3), we further investigated variation of the first PC basis vectors ( 11 1 and 11 2 ) with the truncation of | | . Figure S5 demonstrates that, when | | ≥ 50 pixels, if PC1 increases, the peak value of the autocorrelations that is strongly associated with the volume fraction of the austenite phase will also increase. In other words, PC1 reflects the volume fraction. As for PC2, it mainly relates to the peak value and the size of the central area, indicating that PC2 determines the volume fraction, average size, and distribution of the austenite phase. When | | < 50 pixels, PC1 is not only correlated with phase volume fraction but also related to the average size of the phase, and PC2 does not contain the information about phase distribution as it does before. Therefore, we hypothesized that the effective information in the autocorrelations is removed when | | < 50 pixels, leading to a change in physical meaning of the low-dimensional features and a mutation in PC variance.
While removing redundancy in statistical autocorrelations, the distribution of the SVEs in the RVE set in the PCA space was also altered, as shown in Figure S6. When | | ≥ 50 pixels, the low-dimensional features hardly change during truncating while obvious changes of them can be observed in the case of | | < 50 pixels. We quantified the distances between these low-dimensional points and their centroid for each alloy, and then the variance of the distances was labeled as intraclass variance, while the interclass variance represented that between the centroids for the five alloys. Figure 9A visualizes the two variances as a function of | | . When | | reduces from 100 to 50 pixels, intraclass variance slightly decreases and interclass variance varies in an inverse tendency, as indicated by the dotted arrows. It is easily understood that the discrepancy of the curves shown in Figure 9B within the same class (a certain alloy) is dominated by the external redundancy included in the autocorrelations compared with that in different classes (different alloys), whose difference is mainly determined by the central areas of the autocorrelations. The two variances compete with each other, resulting in little variation in the overall population (orange line in Figure 8B). When | | < 50 pixels, the valuable information in the central area starts to be eliminated, the interclass and intraclass variances both increase, and the overall variation is also intensified (orange line in Figure 8B). Therefore, these results prove our hypothesis above, i.e., the critical length of | | that distinguishes valuable information and redundant information is 50 pixels for our microstructures. . Three quantitative metrics were calculated from the scatter plots of the low-dimensional features, as shown in Figure S6.

Improvement of structureproperty linkage
Tolerance factor defines a length scale feature of microstructure called coherence length by Equation (1). However, the certain threshold of is still unknown. An excessively large will mislead the choice of and may lead to a failure of SP linkage. This section is mainly devoted to confirming a precise threshold to improve the built SP linkage.
By calculating pair correlation function (PCF) of the average autocorrelations in the RVE set, we modified the left-hand side of Equation (1) as ⟨ ( ) [23] . For convenience, the item is simply expressed as | | − 2 0 . Obviously, it is a function of | |. Figure 10A gives the variation of | | − 2 0 as a function of | | for the five alloys. When | | ≥ 50 pixels, all of the curves present a plateau. At this point, the values of | | − 2 0 are less than a threshold of 0.005; thus, the critical was confirmed to be 50 pixels (32.52 ).
Using the autocorrelations with different | | (20-100 pixels), we established several SP linkages by Ridge regression and employed LOOCV to evaluate their accuracies, as shown in Figure 10B. of the model for | | = reduces by 9.67% compared with that for | | = 100 pixels (no truncation), and 2 increases by 2.62%. When | | < , the performance of the models starts to deteriorate. The results of the best model are highlighted by red and cyan solid points in Figure 10B, and the predictions agree well with the experiments shown in Figure 10C. The embedded subgraph shows that the MAE between the predicted value and the experimental one is within 6.28 MPa, which is reduced by 37.2% compared with the results in Figure 7D.
We further employed the procedure in Figure 3 on a Ni-Fe-based superalloy dataset to explore the impact of redundancy removement on the accuracy of SP linkage and the threshold . The dataset was collected from [54] . It is noted that the microstructures shown in Figure S7 are extremely different from our experimental ones shown in Figure 1 in terms of morphology. Extraction of low-dimensional features and construction of SP linkage on this superalloy dataset are visualized in Figures S8 and S9. Eventually, the accuracy of the models along with the change of | | trends similar to Figure 10B, and the threshold that is used to determine is also less than 0.005, demonstrating the reliability and university of this threshold in confirming the coherence length of experimental microstructure and assisting in establishing SP linkages.

Advantages of the quantitative metrics based on PCA
Quantitative comparison between two microstructures has always been a fascinating issue. To complete this task, Niezgoda et al. developed a metric to reflect the root-mean-square error between the two-point statistics of each SVE and the target ensemble-averaged statistics, where is the size of the selected SVEs [28] . When the errors for all SVEs are small and close to each other, the amount of information included in the twopoint statistics of these SVEs will be saturated and independent of the size and number of the SVEs. Niezgoda et al. used to successfully construct an RVE set that can be used to predict mechanical properties in a computationally economical manner [22] . Figure S10E provides the variation of with the change of | | in the autocorrelations. Obviously, is strongly correlated with | | . In other words, even for the same group of microstructures, fails to provide a specific and valuable reference for different operators. In addition, as for two SVEs ( ) and ( ) located at different spatial positions of the same sample, the two elements ( ) Our proposed quantitative metrics based on PCA successfully overcome the defects above. From Equation (5) -(7), we can find that the metrics are distance measurement between the low-dimensional features ( ) and ( ) of ( ) and ( ) . Generally, the first several PCs have interpretable physical meanings [29,34,43] . For the constructed RVE set in the present study, 1 represents the volume fraction of the austenitic phase and 2 quantifies the average size and distribution of this phase (a detailed understanding is provided in Figure S5). Thus, the metrics based on PCA can rigorously measure the degree of similarity in physical features of microstructures. In addition, when | | > , the metrics are independent of | | , which can be demonstrated from the results in Figure S6. The reason is that the valuable information contained in autocorrelations is compressed to the first several PCs, while the lower-ranked PCs containing a large amount of redundant information are forcibly truncated, resulting in the invariance of the low-dimensional features. Therefore, the quantitative metrics based on PCA have the advantage of reliability and robustness in measuring the similarity of physical features of microstructures.
In summary, the differences between our metrics and show up in three ways: (1) In form, reflects the distance between two selected two-point statistics, while our metrics reflect the discrepancy between the low-dimensional PC features. (2) In physical meaning, can only reflect the average degree of similarity in morphology pattern of the two-point statistics, while the rigorous measurement of the distance in physical features (phase volume fraction, average grain size, etc.) of the microstructures can also be addressed by our metrics. (3) In stability, is strongly affected by the dimensionality of the two-point statistics, while our metrics are only related to the microstructures.

Advantages and limitations of the method of optimizing microstructural information
Optimization of microstructural information in this study includes two aspects: construction of RVE set and removement of redundancy. There are two premises for an RVE set: (1) the size of members is large enough to ensure statistical homogeneity in the spatial distribution of structural features that can be mapped to macroscopic mechanical properties; and (2) the dispersion of structural features in the RVE set should match the entire material sample [28] . The microstructures contained in the RVE set are independent of their size and location in the samples [ Figure 4 and 5], which meets the first condition. The RVE set absorbs enough structural features that its average autocorrelations converge in PCA space [ Figure 6], indicating the second condition has been met. In addition, the method was successfully applied to the datasets of experimental ferrite steels, dendrite solidification of Al-Cu alloys simulated by PFM and experimental Ni-Fe-based superalloys collected in the literature, and SP linkage with high precision was established by Ridge regression. These impressive results demonstrate the advantages of scientific nature, reliability, and universality.
The developed method is an improved version of the average approximation method to construct an RVE set, which is also not readily applicable to the samples with microstructure gradients, for instance, some additively manufactured samples with coarse columnar grains where the "average grain size" characteristic is meaningless [64] , the samples with high inhomogeneity in size or distribution of the thermodynamic phases where the average treatment loses the local variation nature of the structure [65] , and so on. A rough solution to obtain the statistical information of the overall sample from the SVEs with local gradients is reserving all the original two-point statistics of the SVEs. Nevertheless, dimensional disasters are beyond the scope of conventional dimensionality reduction algorithms, such as PCA. If one insists on extracting two-point statistics of the sample by ensemble averaging the statistics from multiple SVEs, the long-range correlations may be missed. In other words, the SVEs size used to construct an RVE set must exceed the coherence length of the microstructure when the long-range order plays a significant role in the physics of the system [28] . Therefore, the construction of RVE for the samples with microstructure gradients or inhomogeneity remains a challenging task, which is one of the active areas that we will investigate in the future.
Another interesting topic in this study is the tolerance factor that is used to determine the coherence length , as expressed by Equation (1). From the Equation, one can see that, once is determined, the values of autocorrelations with | | greater than will fluctuate within a short interval [ ℎ 2 − , ℎ 2 + ]. As for a microstructure with time dependence and strongly coupled and long-range phenomena such as diffusion, must change over time, and so does the size of RVE [66] . However, the interval length 2 above is a scalar related to twice the margin of the error limitation between autocorrelations and volume fraction squared when | | is larger than , which is independent of time. In other words, evolution time and long-range phenomena may have a significant effect on but not . Based on the aging microstructures of ferrite steels and the creep microstructures of the collected Ni-Fe-based superalloys, we confirmed the critical threshold of to be 0.005 when the main statistical information remained. To verify the inference that may be also suitable for the evolving microstructures with long-range diffusion, a case about the coarsening process of the polydisperse particle during evolution was used. Our previous work investigated the effects of two characters of the particle cluster, i.e., particle number ( ) and particle density in a cluster, on the kinetics of transient coarsening [67] . The microstructures in Figure 3 by Wang et al. were used to analyze the problem above [67] .
Figure S11 provides these microstructures (with four groups of different combinations of and ), and the variation of pair correlation function | | with different | | ; the detailed calculation process is also illustrated in Figure S11. For each group of the parameters, was determined by locating the minimal | | when | | curves appear platform. Interestingly, as shown in Figure S12, even if changes with different degrees over evolving time, | |≥ − 2 0 , i.e., , is still at a low level (< 0.005), indicating is affected by the solute diffusion during evolving and is indeed independent of evolving time or long-range diffusion. Additionally, of 0.005 may be a generalized metric to determine of a microstructure, which can be demonstrated from the results shown in Figure 10, Figure S9, and the gray shadow area in Figure S11.

Application prospects and limitations for the proposed scheme
In the practical application of the proposed scheme, flexible feature selection is allowed, such as the addition of other necessary factors in addition to the low-dimensional PC features of microstructures. The factors here can be directly measured from the material samples or filtered by feature engineering. The factors that may be important to the yield strength (grain size, precipitated phase, dislocation, etc.) were indeed ignored in our study, resulting in a seemingly "capped" predictability of the final model even with the optimized parameter selection, i.e., 2 value of < 90%. Strictly speaking, these factors should be incorporated into our scheme to produce a more scientific and robust SP linkage. However, one main contribution of this research is to provide a practical computing strategy for constructing an RVE set. The size and number of small sub-domains in the microstructures is the final optimization objective. Our scheme successfully addressed this goal, although the established SP linkage could be further improved by considering more factors. If readers attempt to use the scheme to predict the mechanical properties of interest related to microstructures, more characterization and/or measurements conducted with expert knowledges are suggested to obtain sufficient inputs of the ML model. It is important to note that these supplementary factors and low-dimensional PC features can be combined to train an ML model, as done in this study. The only effort required is to increase the number of input features.
A major strength of the proposed scheme comes from its ability to extract reliable low-dimensional features by optimizing structural information in an RVE set. Using the features, one can place the microstructures into correct classes by flexibly combining supervised or semi-supervised ML algorithms to study the relationship between structural features and resulting properties for a special material system such as Ag-Al-Cu ternary eutectic alloys or superalloys with multiple strengthening patterns [40,43,68] . Therefore, the scheme can accelerate and improve the procedure of microstructure classification. In addition, our previous study proved that introducing two-point statistical information on microstructures can enhance PSP linkages, which may be further improved by the scheme in this study [7] . We believe that it is competent to predict mechanical properties for most material systems, especially in the case of long-term service in a harsh experimental environment [31][32][33] . Unfortunately, to our best knowledge, there is no effort to apply two-point statistical information to realize the goal of alloy design. Molkeri et al. proved that explicit incorporation of microstructure knowledge in the materials design framework can significantly enhance the materials optimization process [8] , and we previously developed an iterative strategy to search ultra-strength martensitic stainless steels in a global-oriented manner [48] . These two studies provide confidence that our scheme [ Figure 3], combined with the previously proposed iteration strategy, can be applied to rapidly discover new alloys in experiments. In conclusion, there is a broad application prospect of our scheme in microstructure classification, property prediction, and reverse engineering for designing new materials.

CONCLUSIONS
We propose a novel scheme to construct SP linkage by optimizing microstructure information, which is achieved via employing RI2SS, PCA, and Ridge regression. A small experimental dataset for ferrite steels was created. We designed reliable and robust distance metrics, , , and * − −1 * , to quantify the optimal size ( 20 ) and number (6) of members in the RVE set. While constructing the set, an innovative method called recursive addition was developed. The primary SP linkage keeps a high accuracy ( 2 = 0.8680, < 10MPa).
After removing redundant information in the autocorrelations, the accuracy was obviously improved by 37.2% ( 2 = 0.8941, < 6.28MPa). As another contribution of this work, the threshold of tolerance factor that determines the coherence length in a microstructure was confirmed to be 0.005. Finally, the scientific nature, reliability, and universality of this scheme were proved by performing experiments on two other datasets (dendrite solidification data of Al-Cu alloys simulated by PFM and experimental Ni-Fe-based superalloys data collected in the literature). More importantly, broad application prospects in microstructure classification, property prediction, and alloy design are expected for the scheme.