Generative models for inverse design of inorganic solid materials

Overwhelming evidence has been accumulating that materials informatics can provide a novel solution for materials discovery. While the conventional approach to innovation relies mainly on experimentation, the generative models stemming from the field of machine learning can realize the long-held dream of inverse design, where properties are mapped to the chemical structures. In this review, we introduce the general aspects of inverse materials design and provide a brief overview of two generative models, variational autoencoder and generative adversarial network, which can be utilized to generate and optimize inorganic solid materials according to their properties. Reversible representation schemes for generative models are compared between molecular and crystalline structures, and challenges in regard to the latter are also discussed. Finally, we summarize the recent application of generative models in the exploration of chemical space with compositional and configurational degrees of freedom, and potential future directions are speculatively outlined.


INTRODUCTION
In the course of history, material innovation has always been in the spotlight of industrial revolutions and has had an overwhelming impact on the economics. Before the 18th century, agriculture was at the core of human communities, and heavy reliance was placed on bronze and iron objects in agricultural activities [ Figure 1A]. After the invention of the steam engine, mechanization began to take place, thus accelerating the industrialization of material production and creating the foundation of modern society. During this time, the technical progress of the steel manufacturing industry has made a substantive contribution to the development of railway and steamship transportation. At the end of the 19th century, electricity inaugurated its entrance on the historical scene, while the successful synthesis of organic polymers has served as another stimulus for the advancement of new technologies, such as automobiles and airplanes. Since then, the chemical industry has ushered in an era of high-speed development. Later in the 20th century came the third industrial revolution, where semiconductor materials are playing the vital role. The advent of computers and electronic technologies has revolutionized the industrial production globally and given rise to unprecedented opportunities for innovations in all aspects of our lives. To date, the uptake of computer science in material research and other realms has been highly advocated due to its unfathomable potential power. In silico materials design based on artificial intelligence and big data is becoming ever more feasible and realistic, which could constitute a new paradigm in the field of materials science [1][2][3][4][5][6][7][8] .
The traditional materials discovery process is forward; that is, all the candidate materials that are expected to possess the desired properties will be directly synthesized and examined, until the most promising one is found. The whole process includes three steps: conjecture, synthesis, and test [ Figure 1B]. Empirical physical and chemical rules are first employed to acquire the list of potential materials for study, which are then experimentally synthesized and characterized. The measured properties of these materials will be compared with each other, through which new knowledge is generated to refine the empirical rules [9] . This trial-and-error process is labor-intensive and time-consuming. In this context, inverse design is an appealing strategy to close the loop, which can help guide the exploratory research of materials discovery and mitigate the need for one-by-one material examination in the chemical space [10] . By definition, inverse design in materials science means that, given target functionality, the compositions and structures of potential materials are stringently optimized prior to experiments. To this end, we can rely on the advances in artificial intelligence, which can enable the long-held dream of fast inverse design in an arbitrarily large chemical space, identifying materials with a high probability of expected properties at the expense of acceptable computational complexity.
Machine learning, one of the most popular typologies of algorithms for artificial intelligence, has already been leveraged to achieve inverse design of both molecules and crystals [8,[11][12][13][14][15][16][17][18][19][20][21] . This is a multidisciplinary endeavor that demands efficient structure encoding schemes and robust data management techniques. Despite being computationally efficient, this approach requires a plethora of material structures and properties for training, which generally constitutes a major obstacle for its practical application. However, the fast-paced developments in high-throughput ab initio calculations and the establishment of the corresponding database have made it possible to build machine learning models for a variety of materials. Consequently, inverse design by machine learning has risen to a prominence over conventional highthroughput ab initio calculations and stood at the frontier of in silico materials design [10] .
Generative models in machine learning are especially applicable to inverse design of materials. In this review, we focus on two generative models that are most widely used in the inverse design of inorganic solid materials: variational autoencoder (VAE) [22] and generative adversarial network (GAN) [23] . The frameworks for reversible coding of molecules and solid materials are elaborated and compared, with emphasis placed on the representation schemes for the generative models. Two directions for materials inverse design are specifically discussed: compositional and structural optimization with various levels of constraints. We further highlight the major challenges and limitations faced by the inverse design of inorganic solid materials and suggest some possible solutions as well as directions worth further exploration.

INVERSE DESIGN
Inverse design in materials science requires the navigation in chemical space through calculation and simulation, for which there emerge three categories of methodologies exploited to enable material identification: (1) high-throughput screening; (2) global optimization; and (3) generative models.
High-throughput screening via ab initio calculations has been widely adopted for inverse design due to the development of ab initio calculation codes and high-performance computing hardware in the past two decades [24][25][26] . Starting from a portfolio of materials chosen on the basis of intuition, density functional theory or Hartree-Fock calculations are carried out for each of the materials to obtain the "predicted properties". By sorting these materials according to the predicted properties, candidate materials can be readily found, thus obviating the need for synthesizing the whole set of chosen materials. The manual efforts spent on experiments can be greatly reduced. We would like to note that the process of high-throughput screening is akin to the forward design process, but the former relies only on computation and can be easily automated.
One of the most critical issues in high-throughput screening is to determine an appropriate chemical space. Too large chemical space will correspond to an enormous number of materials for calculation according to the permutation and combination, thus resulting in excessively high computational costs [27] , while, with too small chemical space, researchers will risk disqualifying many opportunities that might actually lead to discovery of promising materials. Expertise is therefore of paramount importance in exploring the chemical space.
Global optimization is more efficient in navigating the chemical space than high-throughput screening. Unlike local optimization used in traditional ab initio calculations, global optimization algorithms focus on the entire feasible set and search for the global optimal solution by traversing all possible optimal structures in the chemical space. Representative global optimization algorithms include simulated annealing, particle swarm optimization, genetic algorithm, and simplex method. Among them, the genetic algorithm is a popular choice to determine the optimal solution in the field of materials science. We take the global optimization of clusters [28] as an example. The atomic coordinates ("genotype") of the cluster ("population") that need to be optimized are perturbed ("variation" or "hybridization"), and cluster configurations passing geometrical evaluation ("fine adaptability") are preserved. By iterating the above process, new cluster populations are generated, until convergence is reached from which the final configuration is selected to be the global optimal solution.
While global optimization is quite successful for inverse design of materials, data-driven approaches, e.g., machine learning as mentioned above, could further push the frontier of this field [29] . Material databases, such as the Inorganic Crystal Structure Database (ICSD) [30] , the Open Quantum Materials Database [31] , and the Materials Project (MP) [32] database, have provided a wealth of material data, which can remarkably facilitate the development of data-driven materials discovery. Moreover, the combination of simulated data and machine learning may lead to the conceptualization of new chemical rules [33] , thus forming a novel perspective for materials design strategies.
Generative models in machine learning can effectively learn the real distributions and are therefore suitable for inverse design. Different from the discriminant models that calculate the conditional probability of the target variable under the premise of given observation variable values, the generative models emphasize on the total probability of all variables. Therefore, generation models can be used to simulate the distribution of any variable of interest. In the tasks of image generation [34][35][36] , text generation [37][38][39] , video generation [40] , and audio synthesis [41][42][43] , generative models have achieved amazing performance. As the common generative models, VAE and GAN are widely used in the field of materials science and have received extensive validation.
VAE is a deep generative model based on the autoencoder [ Figure 2A]. An autoencoder can be applied to problems such as dimensionality reduction and feature extraction. Its basic framework is first mapping samples to hidden variables in low-dimensional space through encoder, and then restoring the hidden variables to reconstructed samples through decoder. To improve the ability of the decoder to generate new materials rather than to derive a unique mapping, VAE maps the material to a random variable obeying the explicit definition of multivariate normal distribution through a constraint encoder. Therefore, the hidden variable Z is actually the probability distribution of the material.
As compared with VAE, the probability density function of the hidden variables generated by GAN is implicit. GAN consists of generator and discriminator [ Figure 2C], where the former is used to receive random variable Z and generate fake samples G(Z). The discriminator D receives the real sample X and the fake sample G(Z) generated by the generator at the same time and outputs the probability that G(Z) is considered to be a real sample. The results are fed back to the generator G to guide the training of G. In the process of GAN training, the generator and the discriminator update their own parameters to minimize the loss. A Nash equilibrium state is finally achieved, and the model is optimal after continuous iterative optimization.

REPRESENTATION
Despite the fruitful achievements in inverse design of molecules via generative models [20,[44][45][46] , their application in inorganic solid materials remains an outstanding challenge due to the difficulty in encoding the structures. In the following, we briefly outline some of the most typical structural representation schemes for generative models.
The wide application of generative models to the organic molecules [46,47] lies in the excellent representation schemes for molecules, which have both reversibility and symmetric invariance. Reversibility means that the digital space is mapped bijectively to real molecules, while symmetry invariance means that the representations after rotation, translation, and permutation can be identified as the same molecule before these operations. Simplified molecular input line entry system (SMILES) strings [48,49] and molecular graphs [50] are among the most renowned representation schemes [ Figure 3A].
SMILES is a sequence-based text representation. Its power lies in the uniqueness of SMILES representation, namely, the standard SMILES representation can ensure that each chemical molecule has only one SMILES string, and it is a real language structure rather than just a computer data structure, which offers a natural advantage in using machine learning language models. For example, recurrent neural networks containing long short-term memory have been trained as generative models for molecules [45,55] , which can generate valid SMILES strings with high accuracy. Gómez-Bombarelli et al. [19] reported a VAE model using SMILES to build multidimensional continuous molecule representation. Adversarial networks using SMILES representation are also investigated [56,57] . Because of its unique reversible mapping and clear meaning, SMILES has been most frequently applied in inverse molecular design [20,58,59] .
In the molecular graph G = (V, E), atoms are represented as nodes v i ∈V and chemical bonds as edges (v i , v j ) ∈E. Nodes and edges are assigned with labels according to the type of atoms and chemical bonds. Many attempts have been made to construct generative models with the graph representation of molecules. Li et al. [60] proposed a new de novo molecular design framework based on a type of sequential graph generator, which demonstrates good accuracy. Novel generative models based on graph neural networks have also been reported, capable of representing synthetic graphs with certain topological properties and molecules [61] . It is worth noting that the training set appears to affect the way the model generates molecular  [51] . (C) Matrix representation for Mg-Mn-O ternary materials. This figure is quoted with permission from Kim et al. [52] . (D) Voxel grid representation of crystal with information from CIF file. This figure is quoted with permission from Hoffmann et al . [53] . (E) Image representation of atomic position by Gaussian kernel distribution. This figure is quoted with permission from Noh et al. [54] . graph. The visualization of the molecular generation processes indicates that a model trained with canonical ordering graph prefers to generate the molecular graph node by node, while a model trained with random ordering graph prefers to generate the graph piece by piece. To avoid the difficulty in optimizing the gradient on the discrete graph structures, GraphVAE can be employed to directly output a probabilistic fully connected graph that can predefine maximum size at once. The model has been successfully applied to the generation task of small molecular graphs [62] .
Reversibility and symmetry invariance for the above representations of molecules come from the definite identification of chemical bonds, which determines the number of atomic connection (saturability) and orbital overlap direction (directivity). However, for inorganic solid materials, the chemical bonding is much more complicated. Therefore, saturability and directivity are often unsatisfied, leading to the failure of structure representation solely in terms of the connection between atoms. Although the crystal representation based on graph theory has been developed [63] , it is not feasible to reconstruct the crystal structure from the graph, which restricts its application in generative models.
Periodicity is another critical issue for crystal representations. Properties of crystals depend not only on the atomic arrangement in the structure unit, but also on the periodic repetition of the unit in the space. This would raise a new problem that is difficult to deal with: there are various choices for selecting the cell when representing one crystal structure and different choices would lead to large or even unacceptable errors in encoding crystals with similar or identical structures. Such a question is still not fully resolved in recent inverse design studies.
Various general descriptors for solid materials based on specific tasks have been developed, and some related studies have summarized these descriptors [5,[64][65][66][67] . For conventional machine learning tasks, a good representation should not only satisfy the key criteria, namely, uniqueness, universality, and efficiency [68] , but also be explicitly tailored to specific subfields (e.g., batteries [64,69] , catalysts [70] , or photovoltaics [71] ). For the tasks of inverse design via generative models, however, the reversibility, symmetry invariance, and periodicity mentioned above are also indispensable, although it is quite difficult to fulfill all criteria at the same time. Here, we list some of the representative structure encoding schemes that have been used for inverse design of inorganic solid materials.
Bag-of-atoms representation is one of the efforts to encode inorganic crystals [ Figure 3B] [51] , which considers a single structure and is only designed for optimization of compositions. Some generative models use the lattice vector and atomic position matrix in the crystallographic information files (CIFs) for structure representation [ Figure 3C]. For example, Nouira et al. [72] trained a GAN model to generate novel ternary metal hydrides from observed binary structures. Ren et al. [73] embedded solid-state physics knowledge to construct descriptors which combine both the real space and the momentum space, and display lower error in predicting the formation energy and bandgap. The real space matrix was primarily constructed by the lattice vectors and the atomic coordinates, while the momentum space matrix involved the representative crystal planes that describe symmetry and periodicity. Other works try to convert the atomic position matrix into the density matrix [53,74,75] [ Figure 3D], where the locations of atoms can be reconstructed by augmenting the training dataset via rotating and expanding the single unit cell. In their work, an error of 0.5 Å in atomic position between the predicted and real structures was guaranteed for nearly 99% of the atoms. Another approach is to let the neural network learn the encoding requirements by itself. Noh et al. [54] developed a 3D image representation [ Figure 3E], which was similar in essence to the density matrix representation but relies on Gaussian kernel distribution of atoms. We note that data augmentation is applicable to solve the problem of symmetry invariance to a certain extent, yet it would significantly increase the computational burden. Efficient representation of inorganic solid materials for generative models is still a direction that requires more exploration.

TASK OF INVERSE DESIGN
Current research of inverse design in materials science is mainly centered on two topics: compositional optimization and structure prediction. These correspond to the exploration of chemical space by focusing on the compositional and configurational degrees of freedom.
A prohibitive computational cost would be expected if we try to explore the whole compositional space. For a four-element compound, a survey of the first 103 elements of the periodic table would result in a total of 10 12 different compounds through permutation and combination. The number would be reduced to 10 10 if charge neutrality and electronegativity balance are taken into consideration [76] . In this case, generative models in machine learning can provide an affordable means to navigate the compositional space by implicitly learning the underlying chemical rules. Dan et al. [77] trained a GAN model with materials from the ICSD database, where two million chemical compositions were generated. Overall, 84.5% of the predicted materials are charge-neutral and electronegativity-balanced, even though there were no chemical rules explicitly enforced. It is worth noting that inverse design of compositions is not restricted to representation schemes for encoding the atomic structures; that is, features not containing any structural information could also be utilized as input. On the other hand, Nguyen et al. [78] developed a hybrid generativediscriminative model with partial phase diagram as the feature, which holds great promise for speeding up the compositional design of aluminum alloys by over 100,000 times.
As compared to compositional degrees of freedom in materials, the problem with configurational degrees of freedom is much more complex. Many challenges remain to be addressed for inverse design of crystal structures. Learning from the atomic positions in the training dataset, the generative models can estimate the probability of a specific atom occupying a certain position in the space. We highlight one such example, the image-based materials generator (iMatGen) framework designed by Noh et al. [54] The iMatGen first encodes the vanadium oxide CIFs into a 3D image and uses continuous representations as well as formation energies as inputs to the VAE for the construction of latent space for V-O binary compounds. By decoding the sampled points back to crystal structures, 26 out of the 31 previously known vanadium oxides were rediscovered, and over 40 entirely new V-O binary compounds with relatively high stability were generated.
It is generally anticipated that, for large spatial freedom, a huge training set would be needed to guarantee that the model can effectively learn the distribution laws of atoms in space. Court et al. [74] trained a VAE and UNet on 78,750 ternary perovskites, binary alloys, and Heusler compounds. The average mean absolute error (MAE) on validation dataset for the lattice parameters is as low as 0.06 Å, and the average earth-mover distance of atomic coordinates between encoding and decoding is 0.09 Å. In another work, 24,785 ternary compounds were screened from the MP database for VAE training [73] . The MAE for the atomic sites is 0.001, and the percentage mean absolute errors for the length of lattice constant and the lattice angle are 7.41% and 3.99%, respectively. After careful calculations for 27 predicted crystals, two of them exhibited power factors comparable to the best thermoelectric materials.
Given that crystals composed of the same elements often have similar structures, it is generally more feasible to train models with elemental constraints, which could reduce the reliance on large dataset. For metal organic frameworks [79] , zeolites [75] , and two-dimensional graphene/h-BN hybrids [80] , the constituent elements are rather restricted, while there are a large amount of data for use in training the generative models. Nevertheless, a situation of high elemental diversity and low structural diversity is often encountered for inorganic materials datasets, which underlines the crucial role of element substitution [21,52,54] , few-shot learning [73,81] , and transfer learning [82,83] in tackling the inverse design tasks.
The inclusion of properties during the inverse design is still challenging, whether for the exploration of composition or structure. Generally, the constraints of the networks can only ensure that GAN and VAE can reconstruct materials with composition and structure close to the real materials. For an extremely large chemical space, the approach based on basic GAN or VAE are uncontrollable, and the predicted results could be meaningless. To solve the problem, the conditional generative models, such as conditional VAE [84] and conditional GAN [85] , can be leveraged for material generation with desired properties [ Figure 2B and D]. To be specific, the conditional generative models are trained with the real distribution P (x, y) of the material representation x (unit cell parameters, atomic positions, etc.) and properties y (formation energy, band gap, etc.). The distribution P' (x, y) is learned such that it resembles P (x, y) as much as possible. Then, x can be obtained according to the conditional probability distribution P (x|y) and the expected properties y.
The conditional deep-feature-consistent VAE was constructed recently, which employed the electrondensity maps as input and the formation energy per atom as condition [74] . Clustering effect related to the formation energy in latent space was shown. To perform both regressional and conditional tasks in GAN, Dong et al. [80] developed an improved CGAN called regressional and conditional GAN, which contains a regression convolution neural network between generator and discriminator to predict bandgap as well as output the latent features from material structures [ Figure 4A]. In addition to the use of constraints in the generative model, Pathak et al. [51] suggested that additional predictive networks can be used to further filter materials with specific properties. They proposed a deep learning based inorganic material generator architecture consisting of a generator and a predictor [ Figure 4B]. Long et al. [21] further integrated constraint network into GAN as a back propagator without embedding properties in input to realize automated optimization of generator and predictor. A constrained crystals deep convolutional GAN was thus established [ Figure 4C], which is more efficient than traditional GAN in generating stable structures.
It is worth mentioning that, for GAN frameworks, in addition to the trouble of material representation mentioned above, another critical issue is that the training process of GAN can be very challenging to achieve convergence. The gradient disappearance [86] and mode collapse [87] could seriously hinder the application of GAN in material generation. Gradient disappearance means that, when the back propagation is used for neural network training, the gradient propagated to shallow network cannot give rise to numerical disturbance, which eventually leads to slow convergence or even fails to converge. Especially, an accurate discriminator is more likely to aggravate the issue of gradient disappearance [88] . Mode collapse refers to the situation that GAN cannot generate diversified materials; that is, only materials that strongly resemble or even belong to the real samples can be derived from the model. This issue is a great handicap to data augmentation tasks. The main reason for gradient disappearance and mode collapse in GAN training lies in that the Jensen-Shannon (JS) divergence is used to measure the distance between two distributions. One solution is to adopt Wasserstein GAN [89] , which relies on Wasserstein distance instead of JS divergence and performs better when there is negligible overlap between both distributions. There are also variants such as Laplacian pyramid of GAN [90] , boundary equilibrium GAN [91] , and BigGAN [92] , which differed in the cost function. Although the pros and cons of these models in image generation have been frequently discussed, the influence of cost functions is still an interesting topic in material generation.
It is worth mentioning that in the latent space of VAE, materials are represented as continuous and differentiable vectors, which is completely different from the way of generating materials with target property. To explore the unknown area in the latent space, a routine sampling strategy is to decode the random vector near the known material in the latent space. Apparently, the shape of the latent space and the sampling strategy will significantly affect the generation results. To enhance the sampling efficiency, iMatGen carried out a formation energy classification to latent vector. The crystals with formation energy greater than 0.5 eV/atom were regarded as stable materials [54] , resulting in the separation of latent space into two regions [ Figure 4D]. By executing the sampling strategy only in the stable region, the formation energy of vanadium oxides can be effectively constrained. Another effective approach to impose constraints when navigating in the latent space would be to train neural networks for property prediction from latent vector [19] , after which the gradient optimization algorithm in the latent space could be applied.
Another noteworthy issue is the visualization of the latent space in VAE. In a standard visualization process, the encoder encodes high-dimensional material descriptors into low-dimensional latent vectors, and then dimension reduction algorithm is applied to map these vectors to two-dimensional space for visualization. Common dimensionality reduction algorithms include principal component analysis [93] , multidimensional scaling [94] , t-distributed stochastic neighbor embedding [95] , sketch-map [96] , etc. As the input of the  [81] . (B) Deep learning based inorganic material generator (DING) framework consisting of a generator module and a predictor module. This figure is quoted with permission from Pathak et al. [51] . (C) Constrained crystals deep convolutional GAN (CCDCGAN) framework. This figure is quoted with permission from Long et al. [21] . (D) Materials generator module in iMatGen and the visualization of the latent space. This figure is quoted with permission from Noh et al. [54] .
dimensionality reduction algorithm is a latent vector without manual selection of features, we are incapable of predicting the final performance of dimensionality reduction in advance, and hence the choice of algorithms is generally subjected to the characteristics of the dataset.
The most critical step in forward design is experimental synthesis. In a recent study, a machine learning model was constructed to quantify the probability of synthesis for virtual materials [97] , in which an 87.5% true positive prediction accuracy was attained. Actually, even if we know the synthesizability of the materials, it is still hoped that the condition for material synthesis can also be predicted. As far as we know, there is no generative model available in this research direction, whereas most of the previous studies have been done using optimization methods [98][99][100] . One possible challenge is to obtain sufficient experimental data with small error. The establishment of generative models to predict the experimental condition for material synthesis is still an open challenge.

OUTLOOK
Inverse design is an important path for rapid materials discovery in the future. With the development of high-throughput computation and material databases, data-driven generative models promise to be powerful tools for inverse materials design. Although generative models have begun to show promise in the field of inverse design for organic molecules, their application in inorganic solid materials is still in the infancy period. Here, we proceed by listing some of the critical issues that require consideration.
First, the performance of the generative model is dictated by the size and quality of the training set. Although high-throughput ab initio calculations can be applied for the generation of a portfolio of materials with element substitution, it is notoriously time-consuming when complicated structures are involved. We note that few-shot learning and transfer learning may improve the performance of inverse design of inorganic solid materials, but they are still far from satisfactory. For transfer learning, the inconsistent distribution of atomic positions in different types of crystals may be the key problem to be solved.
Secondly, the representation and encoding of inorganic solid materials could influence the sampling in latent space and the generation results. Periodicity is an important characteristic that inorganic solid materials differ from molecules. Relying solely on atomic coordinates and unit cell parameters as inputs of the material generation model is obviously not realistic. Although it can meet the requirements of reversibility, it cannot realize symmetry invariance and periodicity, in which case two cells representing the same crystal can be distinct from each other in latent space. This could jeopardize the predictive power of the model. To solve this problem, knowledge from mathematics and solid-state physics is indispensable [73] .
Finally, generative networks themselves are an issue worth exploring. Currently, almost all results obtained from generation models are not directly usable and require post-processing, such as atomic trade-offs [54,75] and ab initio structural optimizations [52] . It is worth tackling how networks can be trained to produce materials that are usable with minimal optimization. In addition, reinforcement learning has been exercised in molecular generation [20,56,101] . It would be fruitful to evaluate its application in the generative models for inverse design of inorganic solid materials.

Authors' contributions
Made substantial contributions to conception and design of the study and performed data analysis and interpretation: Chen LT, Li SN, Pan F Performed data acquisition: Chen LT, Zhang WT, Nie ZW Performed administrative support: Pan F

Availability of data and materials
Not applicable.

Financial support and sponsorship
This work was financially supported by the National Key R&D Program of China (2016YFB0700600, 2020YFB0704500), the Shenzhen Science and Technology Research Grant (No. JCYJ20200109140416788), and the Chemistry and Chemical Engineering Guangdong Laboratory (Grant No. 1922018).