^{*}Correspondence to: Prof. Gian-Marco Rignanese, Institute of Condensed Matter and Nanosciences (IMCN), UCLouvain, Chemin des Étoiles 8, Louvain-la-Neuve B-1348, Belgium. E-mail:

To improve the precision of machine-learning predictions, we investigate various techniques that combine multiple quality sources for the same property. In particular, focusing on the electronic band gap, we aim at having the lowest error by taking advantage of all available experimental measurements and density-functional theory calculations. We show that learning about the difference between high- and low-quality values, considered a correction, significantly improves the results compared to learning on the sole high-quality experimental data. As a preliminary step, we also introduce an extension of the MODNet model, which consists of using a genetic algorithm for hyperparameter optimization. Thanks to this, MODNet is shown to achieve excellent performance on the Matbench test suite.

The discovery of functional materials is the origin of many technological advances, from batteries to optoelectronic devices^{[1]}. Given the extent of the compounds' space, it is essential to achieve fast and reliable screening to identify new and interesting candidates. In this framework, thanks to the growing number of available experimental and theoretical data^{[2-4]}, machine learning (ML) has recently emerged as an extremely useful tool^{[5-7]}. However, obtaining reliable ML models typically requires large and high-accuracy datasets. Unfortunately, there is often an inverse correlation between quantity and quality. Large datasets are, in many cases, theoretical ones, such as those based on cheap density-functional theory (DFT) functionals, while high-accuracy experimental datasets usually have a rather small size. For instance, if one considers the electronic band gap (i.e., without excitonic effects), the Materials Project^{[8]} contains ^{[9]} calculations, while the experimental dataset collected by Zhuo ^{[10]} consists of two orders of magnitude fewer measurements. Band gaps estimated by DFT calculations typically lead to a systematic underestimation of 30–100% with respect to the experimental results^{[11]}. Therefore, a model built on this data will present systematic errors with respect to reality. Alternatives exist, such as the rigorous many-body perturbation theory based on ^{[12]}.

More generally, material properties are often bundled with different degrees of accuracy. The most straightforward case is a dataset gathering both experimental and DFT results, but it is not uncommon at all to see a dataset combining calculations computed with different exchange-correlation functionals such as PBE and the more accurate Heyd–Scuseria–Ernzerhof (HSE)^{[13]}.

For screening materials, one ideally wants to obtain the best estimate of the actual value (i.e., experimental) of the required property. This means that models should, in principle, only be built from scarce experimental data. In practice, it is, however, possible to gain knowledge from the larger but less qualitative datasets in order to improve predictions of the experimental quantity. This idea has already been investigated previously^{[14, 15]}. Kauwe ^{[14]} combined multiple learners, forming a so-called ^{[15]} built a convolutional graph neural network based on MEGNet, where the fidelity of each sample is encoded through an embedding to form an additional state feature of the crystal. This method has the advantage of working on sets of compounds that can be very diverse, in the sense that all the compounds do not have to be present in each dataset (in contrast to Kauwe's method). However, given the complexity of the graph neural network, the errors are still slightly higher than with state-of-the-art methods relying on the smaller experimental dataset only^{[16]}.

Another popular strategy is to use transfer learning. In this approach, a neural network is first fitted on a large source dataset, followed by fine-tuning on a smaller target dataset. The network will transfer knowledge (embedded in the weights) from the source to the target task. This technique has successfully been applied in several studies, covering properties from Li-ion conductivity for solid electrolytes to steel microstructure segmentation^{[17-23]}. To be effective, the source task should be closely related to the target task.

In this work, we compare different techniques that combine multiple quality sources for the same property in order to improve the accuracy of the predictions. The property of interest is the experimental band gap. Various studies have tackled the band gap problem (experimental and simulated), from composition or structure-specific tasks to more general approaches^{[10, 18, 24-29]}, with only more recently efforts on a multi-fidelity approach^{[14, 15]}. In particular, we aim at having the lowest prediction error on the experimental band gap for any structure by combining experimental measurements with both PBE and HSE DFT calculations. We show that an improvement of 17% can be achieved, with respect to the predictions resulting from learning the sole high-quality experimental data, when learning on the difference between high- and low-quality values. On the contrary, ensembling does not seem to be particularly helpful. To improve the results further, as a preliminary step, we also introduce an extension of the Material Optimal Descriptor Network (MODNet) model. It consists of a new optimization procedure for the hyperparameters relying on a genetic algorithm (GA). We verify that, thanks to this extension, MODNet further improves performance on the Matbench test suite.

The MODNet is used throughout this work. It is an open-source framework for predicting materials properties from primitives such as composition or structure^{[16]}. It was designed to make the most efficient use of small datasets. The model relies on a feedforward neural network and the selection of physically meaningful features. This reduces the space without relying on a massive amount of data. To have good performance at low data size, features are generated using matminer and are therefore derived from chemical, physical, and geometrical considerations. Thanks to this, part of the learning is already done as they exploit existing chemical knowledge, in contrast to graph networks. Second, for small datasets, having a small feature space is essential to limit the curse of dimensionality. An iterative procedure is used based on a relevance-redundancy criterion measured through the normalized mutual information between all pairs of features, as well as between features and targets. MODNet has been shown to be very effective in predicting various properties of solids with small datasets. The reader is referred to the work in^{[16]} for more details.

In the present work, we first introduce a new approach for the choice of the hyperparameters of MODNet, relying on a genetic algorithm (GA). After creating an initial population of hyperparameters, the best individuals (based on a validation loss) are propagated by mutations and crossover to further generations. Eventually, the model architecture with the lowest validation loss is selected. The activation function, loss, and number of layers are fixed to, respectively, an exponential linear unit, mean absolute error, and 4. The number of neurons (from 8 to 320), number of ^{[30]}.

The GA keeps randomness, while giving more importance to local optima. Therefore, a satisfactory set of hyperparameters is found more quickly and at a reduced computational cost compared to the standard grid- or random-search previously used^{[31]}. As is shown below, this approach results in a relative improvement of up to 12% on the Matbench tasks, compared to the previously used grid-search. Moreover, the neural networks are always small (four layers), which results in fast training and prediction time.

We benchmarked MODNet with GA optimization on the Matbench v0.1 test suite as provided by Dunn ^{[25]}, following the standard test procedure (nested five-fold). It contains 13 materials properties from 10 datasets ranging from 312 to 132, 752 samples, representing both relatively scarce experimental data and comparatively abundant data, such as DFT formation energies. Inputs are crystal structures for computational results or compositions for experimental measurements. The tasks are either regression or classification. MODNet was applied to all 13 Matbench tasks.

^{[32]}, AMME^{[25]}, CrabNet^{[33]}, and CGCNN^{[34]}) which are the current leaders for at least one of the tasks of Matbench. The Atomistic Line Graph Neural Network (ALIGNN) is a graph convolution network that explicitly models two- and three-body interactions by composing two edge-gated graph convolution layers, the first applied to the atomistic line graph (representing triplet interactions) and the second applied to the atomistic bond graph (representing pair interactions)^{[32]}. Automatminer Express (AMME) is a fully automated machine learning pipeline for predicting materials properties based on matminer^{[35]}. The Compositionally Restricted Attention-Based network (CrabNet) is a self-attention based model, which has more recently been reported as a state-of-the-art model for the prediction of materials properties based on the composition only^{[33]}. The Crystal Graph Convolutional Neural Network (CGCNN) provides a highly accurate prediction on larger datasets^{[34]}. The performance of each model is compared using the mean absolute error (MAE) for regression tasks or the receiver operating characteristic area under the curve (ROC-AUC) for classification tasks. The best score is reported in bold for each task. Furthermore, we also provide as baseline metrics: (i) the results obtained with a random forest (RF) regressor using features from the Sine Coulomb Matrix and MagPie featurization algorithms; and (ii) a dummy model predicting the mean of the training set for the regression tasks or randomly selecting a label in proportion to the distribution of the training set for the classification tasks^{[25]}.

Matbench v0.1 results for MODNet, Automatminer Express (AMME), CrabNet, CGCNN, MEGNet, a random forest (RF) regressor, and a dummy predictor. The scores are MAE for regression (R) tasks or ROC-AUC for classification (C) tasks. The tasks are ordered by increasing the number of samples in the dataset

Steel yield strength (MPa) | 312 | [R] | — | 97.5 | 107.3 | — | 103.5 | 229.7 | |

Exfoliation energy (meV/atom) | 636 | [R] | 43.4 | 39.8 | 45.6 | 49.2 | 50.0 | 67.3 | |

Freq. at last phonon PhDOS peak (cm |
1 265 | [R] | 34.3 | 56.2 | 55.1 | 57.8 | 67.6 | 324.0 | |

Expt. band gap (eV) | 4 604 | [R] | — | 0.416 | 0.346 | — | 0.406 | 1.144 | |

Refractive index | 4 767 | [R] | 0.345 | 0.315 | 0.323 | 0.599 | 0.420 | 0.809 | |

Expt. metallicity | 4 921 | [C] | — | 0.921 | — | — | 0.917 | 0.492 | |

Bulk metallic glass formation | 5 680 | [C] | — | 0.861 | — | — | 0.859 | 0.492 | |

Shear modulus (GPa) | 10 987 | [R] | 0.073 | 0.087 | 0.101 | 0.090 | 0.104 | 0.293 | |

Bulk modulus (GPa) | 10 987 | [R] | 0.057 | 0.065 | 0.076 | 0.071 | 0.082 | 0.290 | |

Formation energy of Perovskite cell (eV) | 18 928 | [R] | 0.091 | 0.201 | 0.406 | 0.045 | 0.236 | 0.566 | |

MP band gap (eV) | 106 113 | [R] | 0.220 | 0.282 | 0.266 | 0.297 | 0.345 | 1.327 | |

MP metallicity | 106 113 | [C] | 0.913 | 0.909 | — | 0.952 | 0.899 | 0.501 | |

Formation energy (eV/atom) | 132 752 | [R] | 0.045 | 0.173 | 0.086 | 0.034 | 0.117 | 1.006 |

As shown in ^{[31]}. This shows the importance of hyperparameters for accurate generalization.

Three different datasets for the electronic band gap with varying levels of accuracy were used in this study: (ⅰ) DFT computational results using the PBE functional; (ⅱ) DFT computational results using the HSE functional; and (ⅲ) experimental measurements. They are referred to as PBE, HSE, and EXP data, respectively.

The PBE data were retrieved from Matbench v0.1, covering a total of 106, 113 samples^{[25]}. Compounds containing noble gases or having a formation energy 150 meV above the convex hull were removed.

The HSE data were recovered from the work by Chen ^{[15]}. The HSE functional typically provides more accurate results than the PBE ones, but it is computationally more expensive. The HSE dataset contains 5987 samples.

For the experimental data, we started from the dataset gathered by Zhuo ^{[10]}, which covers 4604 compositions. This dataset is referred to as EXP^{[36]}, which is available through matminer^{[35]}. The EXP dataset obtained in this way contains a total of 2480 samples with an associated structure. These are considered to be the true values that the multi-fidelity ML models should predict.

Venn diagram over the structures for the different fidelity datasets used in this work. Numbers represent the amount of samples in each corresponding intersection.

To test different learning approaches, we adopted the following systematic procedure. We held out 20% of the EXP data as a test set and trained on the remaining 80%. This was repeated five times and the final result was calculated as the average over the five-fold test data. This outer cross-testing guarantees a fair comparison of the different models.

In this work, various multi-fidelity techniques were compared with the standard single-fidelity MODNet approach. The single-fidelity model was trained only on the experimental data, whereas the multi-fidelity techniques also took advantage of the available knowledge from DFT calculations. The different multi-fidelity techniques investigated in this work are described below.

Schematic of the different multi-fidelity methods. (a)

Here, a MODNet model was first trained on the dataset formed by PBE

This approach can improve the accuracy of the predictions, compared to training a model for each target separately. In our case, instead of training only the EXP dataset, we also used the corresponding PBE values. Although technically nothing prevents it, we chose not to use the corresponding HSE values. Indeed, this would have considerably reduced the size of the training set since

As shown in

MAE on the band gap for the different multi-fidelity learning techniques (see

Single-Fidelity | EXP | 2480 | EXP | 0.382 | 0% |

Single-Fidelity |
EXP |
4604 | EXP | 0.366 | -4% |

Transfer Learning | PBE |
2480 | EXP | 0.397 | +4% |

Joint learning | PBE |
2480 | EXP | 0.368 | -4% |

Stacking Ensemble Learning | PBE |
2480 | EXP | 0.367 | -4% |

Deep-Stacking Ensemble Learning | PBE |
2480 | EXP | 0.370 | -3% |

PBE as a feature | PBE |
2480 | EXP | 0.371 | -3% |

Correction Learning (PBE) | PBE |
2480 | EXP | 0.318 | -17% |

Single-Fidelity | PBE |
325 | HSE |
0.582 | 0% |

Correction Learning (PBE) | PBE |
325 | HSE |
0.442 | -24% |

Correction Learning (HSE) | PBE |
325 | HSE |
0.402 | -31% |

Single-Fidelity | EXP |
4604 | HSE |
0.438 | 0% |

Correction Learning (PBE) | PBE |
2480 | HSE |
0.356 | -19% |

Correction Learning (HSE) | HSE |
325 | HSE |
0.402 | -8% |

Despite the fact that it only contains the compositions, the MODNet model trained on the more-populated EXP

Most of the investigated learning techniques do not overcome this threshold.

Based on these findings, we further investigated whether

In the first case, the

However, in a real situation, such higher-fidelity data (here, the HSE band gaps) are scarcer than lower-fidelity ones. Therefore, we further benchmarked the method taking into account all available data. Three realistic scenarios were considered: (ⅰ) a composition MODNet model trained on the full experimental dataset (4604 compositions); (ⅱ)

Finally, it is interesting to note that the same most relevant features are shared among all models (regardless of the target fidelity or strategy such as difference learning). This can be expected as they all are an approximation of the same physical property. They include the element and energy associated with the highest and lowest occupied molecular orbital (computed from the atomic orbitals) and various elemental statistics (such as atomic weight, column number, and electronegativity). The oxidation state also plays an important role in the prediction. The 20 most relevant features are all composition based, with the sole exception of the spacegroup number.

In this work, we briefly present an extension of the MODNet model consisting of a new procedure for hyperparameter optimization by means of a genetic algorithm. This approach was shown to be more effective and computationally less expensive. Thanks to this, MODNet outperforms current leaders on 8 out of the 13 Matbench tasks, making it a leading model in material properties predictions.

Furthermore, various techniques relying on multi-fidelity data are presented to improve band gap predictions. These techniques aim to take advantage of all the available data, from limited experimental datasets to large computational ones. Among the various methods investigated, the most promising results were obtained with the

Multi-fidelity correction learning can be applied to various other materials properties, with hopefully similar improvements as obtained here. We therefore encourage and expect that it will be used in a wider context.

The authors acknowledge UCLouvain and the F.R.S.-FNRS for financial support. Computational resources have been provided by the supercomputing facilities of the Université catholique de Louvain (CISM/UCLouvain) and the Consortium des Équipements de Calcul Intensif en Fédération Wallonie Bruxelles (CÉCI) funded by the Fond de la Recherche Scientique de Belgique (FRS-FNRS) under convention 2.5020.11 and by the Walloon Region.

Prepared the different multi-fidelity approaches: De Breuck PP

Implemented and ran the different models: Heymans G

Conceptualized and supervised the work: Rignanese GM

De Breuck PP and Heymans G contributed equally to this work.

All authors contributed to the analysis of the results and the writing of the manuscript.

All the Matbench datasets are available at https://matbench.materialsproject.org. The PBE dataset for the electronic band gap is actually one of those. It can be downloaded from the following URL: https://ml.materialsproject.org/projects/matbench_mp_gap.json.gz. The three other datasets for the electronic band gap are provided as csv files in the Supplementary Material. The MODNet model is available on the following GitHub repository: ppdebreuck/modnet^{[30]}.

Not applicable.

All authors declared that there are no conflicts of interest.

Not applicable.

Not applicable.

© The Author(s) 2022.

MODNet v0.1.9;. https://github.com/ppdebreuck/modnet.