^{1}

^{2}

^{2}

^{3}

^{1}

^{2}

^{3}

^{*}

^{2}

^{3}

^{4}

^{*}

^{4}

^{2}

^{3}

^{*}

^{1}China-UK Low Carbon College, Shanghai Jiao Tong University, Shanghai, 201306, China.

^{2}School of Material Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China.

^{3}Materials Genome Initiative Center, Shanghai Jiao Tong University, Shanghai 200240, China.

^{4}Sino-Precious Metals Holding Co., Ltd., Kunming 650106, Yunnan, China.

Materials genome engineering databases represent fundamental infrastructures for data-driven materials design, in which the data resources should satisfy the FAIR (Findable, Accessible, Interoperable and Reusable) principles. However, a variety of challenges, such as data standardization, veracity and longevity, still impede the progress of data-driven materials science, including both high-throughput experiments and simulations. In this work, we propose a metadata schema for lattice thermal conductivity from first-principles calculations. The calculation workflow for lattice thermal conductivity includes structural optimization and the calculation of interatomic force constants and lattice thermal conductivity. The data generated during the calculation process corresponds to the virtual sample information, virtual source data and processed data, respectively, as specified in the

To keep pace with the growing demands of materials science and industry, scientists and engineers hope to design novel functional materials on demand at low cost and within a short period. With great improvements in data generation efficiency by employing both high-throughput experimental and computational tools, there has been an explosion of materials databases. By combining these materials databases with data science and artificial intelligence, materials research has transformed from trial-and-error approaches to the data-driven paradigm^{[1-3]}. Data-driven materials research has been considered as the fourth paradigm of materials research in addition to traditional theoretical, experimental and simulation approaches.

Benefiting from the various open or commercial materials simulation tools, computational materials science is experiencing vigorous development in the design of functional materials and the search for new materials by employing high-throughput screening and computation^{[4-7]}. In particular, the calculation tools based on density functional theory (DFT) enable researchers working in materials science, physics and chemistry to understand the electronic structure of many-body systems, atoms, molecules and condensed phases^{[8,9]}. In one DFT calculation, the data memory of the full inputs and outputs may exceed ten megabytes or even hundreds of megabytes. However, extremely small amounts of data subjectively collected from output files are generally presented in published figures or tables. Typically, the small subsets of data or results published in a research publication are directly relevant to the specific topic discussed in that publication, leading to most of the data produced by high-throughput approaches being stored in the local workstations of researchers. Moreover, publications only present basic calculation details, such as the type of code, exchange-correlation functional pseudopotential, plane-wave cutoff energy,

In recent years, various materials databases have been well developed. The Novel Materials Discovery (NOMAD) Repository^{[10]} has been built to satisfy the increasing demand for storing and sharing materials science data. It offers the codes used in computational materials science and contains all the original inputs and outputs of data in the Materials Project database^{[11]}, Open Quantum Materials Database (OQMD)^{[12]} and Automatic FLOW (AFLOW)^{[13]}. However, the differences in terminology and representation inevitably lead to the data being heterogeneous and difficult for data analytics in data-driven modes^{[14]}. As a result, the quality, consistency and comprehensiveness of data should be further improved to simplify data sharing, reduce the cost and increase the speed of the exchange of scientific information among researchers. Metadata provides information that helps establish relationships between data items and is defined by the National Information Standards Organization (NISO) as “the information we create, store and share to describe things, allows us to interact with these things to obtain the knowledge we need”. A metadata schema is a high-level document that establishes a common method for structuring and understanding data, and it includes the principles and implementation issues for utilizing the schema. Naturally, such metadata could be made into a standard when a consensus is reached in the professional community.

The FAIR (Findable, Accessible, Interoperable and Reusable) principles have been proposed to guide the optimal sharing and reuse of data^{[15]} and therefore the guiding principles for scientific data. In general, a FAIR data infrastructure requires a detailed description of the approach to obtaining data, addressing metadata, ontologies and workflows^{[16]}. However, only the researchers performing the experiments or calculations have the knowledge to provide detailed critical information.

At present, the microbiome metadata standards have been developed and reported successively by the microbiome research community to try to make the microbiome data truly FAIR^{[17]}. However, in the field of medicine^{[18]} and low-carbon energy research^{[19]}, although the data stakeholders are collaborative in advancing FAIR metadata schemas in their respective fields, there are challenges. Likewise, developing a metadata schema is essential in the widespread adoption of the data-driven model in materials science. However, materials science currently lacks such a model^{[20]}.

For a long time, materials data have faced a lack of unified management rules. The data from different sources vary significantly in content and format, resulting in unguaranteed data quality. Simultaneously, the data are not interoperable and data integration and analysis are extremely difficult. These challenges have created a non-negligible obstacle for the exertion of the cohesive effect of materials genome engineering data and the construction of data-driven ecological models. Standards organizations, such as the International Organization for Standardization [Available from: ^{[16]}. The ^{[21] }(i.e., the

As shown in ^{[22]}. Under the guidance of the

Category and content of materials data following the ^{[21]}.

Thermal conductivity is a fundamental transport property that indicates the thermal transport ability of a material. Heat in a solid is mainly carried by electrons and atomic variations. Electronic thermal conductivity (_{e}) is directly related to electrical conductivity (_{e} = ^{[23,24]}.

In this work, we propose a complete metadata schema for lattice thermal conductivity from first-principles calculations. From the top five DFT codes (Gaussian^{[25]}, VASP^{[26]}, QUANTUM ESPRESSO^{[27]}, CASTEP^{[28]} and ORCA^{[29]}) ranked by the number of citations [Available from: ^{[30]} and Thirdorder^{[31] }codes. Many packages, including ALAMODE^{[32]}, almaBTE^{[33]}, phono3py^{[34]} and ShengBTE^{[35]}, can predict phonon thermal conductivity using force constants from DFT calculations. The open-source package, ShengBTE, is used to calculate the final lattice thermal conductivity based on the iterative solution to the PBTE in this work. Following the

Workflow for lattice thermal conductivity calculations from first-principles calculations. Orange and blue boxes represent the steps of the calculations and the results of each step, respectively. The obtained data corresponding to the steps are displayed in the right panel.

By solving the many-body Schrödinger equation or the Kohn-Sham equation, first-principles calculations enable us to understand the electronic structure of crystals and their derived physical or chemical properties at the atomic level. VASP is also employed to conduct ab initio quantum mechanical calculations using either Vanderbilt pseudo-potentials or the projector augmented wave (PAW) method and a plane-wave basis set. In general, structural optimization is the necessary and initial step for first-principles calculations to obtain a fully relaxed crystal structure under given convergence accuracy and a set of calculation parameters. It provides a relatively stable input structure for the subsequent calculations of crystal properties. Although the calculation parameter settings in different cases are subjective, the workflow and necessary files needed to run VASP are deterministic, thereby providing good conditions for the development of a metadata schema.

We formulate the basic framework of the metadata schema and summarize the necessary elements for structural optimization via VASP, as shown in

(A) Workflow for structural optimization. The metadata element and its number of adjustments in each step are listed. (B) Metadata structure for structural optimization.

The computational workflow, as well as the metadata elements, is shown in

Each metadata subset is individual and contains the complete metadata element. The relationship between metadata subset, entity and element is constructed by the Unified Modeling Language (UML), a general-purpose and developmental modeling language in the field of software engineering that is intended to provide a standard method to visualize the design of a system. As shown in

Metadata schema of structural optimization constructed by UML.

The metadata schema defines the description convention from both semantics and syntax. The six attributes, namely, Name, Definition, Data type, Range, Restriction and Maximum number of occurrences, are given in the data dictionary in

After obtaining the relatively stable crystal structure, we then move to the calculation of the force constants. FC2 and FC3 are the second- and third-order derivatives of the potential energy, respectively, which can be written as Taylor expansions of displacements with respect to the equilibrium potential. FC2 is used to perform lattice dynamics to derive the harmonic phonon properties, including phonon dispersions, group velocity

For the metadata schema of force constant calculations via VASP, we formulate the workflow in

(A) Workflow of force constant calculation. The metadata element and its number of adjustments in each step are listed. (B) Metadata structure of force constant calculation.

The computational workflow of the force constant calculation is shown in

The metadata schema of the force constant calculation is shown in

Metadata schema of force constant calculation constructed by UML.

The energy and force information obtained from first-principles calculations enable us to further analyze the crystal properties. In general, users pay more attention to the data obtained by analyzing and processing the existing source data. In the framework of thermal conductivity calculations discussed herein, the PBTE is iteratively solved using ShengBTE, which takes FC2 and FC3 as inputs.

In the metadata schema, the organized metadata structure for the lattice thermal conductivity calculation shown in

(A) Workflow of lattice thermal conductivity calculation. The metadata element and its number of adjustments in each step are listed. (B) Metadata structure of lattice thermal conductivity calculation.

Metadata schema of lattice thermal conductivity calculation constructed by UML.

Under the guidance of the proposed metadata schema, we take the well-studied silicon as an example. Three metadata example tables with respect to structural optimization, second and third force constant calculation and lattice thermal conductivity calculation are presented in ^{[36]}. At 300 K, the calculated value is 131 W/(m K), a little smaller than 146 W/(m K)^{[37] }and 140 W/(m K)^{[38] }due to the sparse

Lattice thermal conductivity of bulk silicon as a function of temperature.

The cornerstone of data-driven materials research is to have datasets consisting of a massive number of AI-ready data suitable for the utilization of artificial intelligence techniques. The standardization of data is a key part of ensuring data quality. The ^{[21] }is a pioneering effort to provide a basic regulation to the content of data, which specifies that the materials data are generally divided into three classes: sample information (the material model generated by calculation is considered the virtual sample); source data (the unprocessed material data generated by the characterization or measurement or virtual source data generated by calculation); processed data. Each individual data entry should cover only one action event (sample preparation, sample characterization or data analysis) and collect the information related to the action as completely as possible.

Motivated by the urgent demands in materials science and the community for sharing and exchanging data, we have proposed a full metadata schema for lattice thermal conductivity from first-principles calculations. The calculation of lattice thermal conductivity is divided into three consecutive processes, namely, structural optimization, force constant calculation and lattice thermal conductivity calculation. The data generated during the three processes now directly corresponds to the virtual sample information, virtual source data and derived data, respectively, as specified in the

This study provides an exemplary use case when applying the

The metadata and the metadata schema we proposed are free for usage. Attentionally, the proposition of metadata schema aims to accelerate the data generated from computational and experimental tools to be in line with the FAIR rules and greatly promote the interoperability and integration of data. Certainly, the commercial computation codes used to generate data in the proposed metadata schema should be with a license.

Performed the research and drafted the manuscript: Rao Y

Designed the study, performed data analysis and interpretation, revised, and finalized the manuscript: Rao Y, Lu Y, Zhang L, Ju S, Yu N, Zhang A, Chen L, Wang H

The data dictionary and metadata example table can be seen in Supplementary Materials, and all input and output files of Si calculation are attached in

This work was supported by the National Key R&D Program of China (2021YFB3702300), National Natural Science Foundation of China (No. 52006134), and Shanghai Pujiang Program (No. 20PJ1407500), and the Major Science and Technology Project of Yunnan Province “Genome Engineering of Rare and Precious Metal Materials in Yunnan Province (Phase One 2020)” (No. 202002AB080001-1). The computations in this paper were run on the

All authors declared that there are no conflicts of interest.

Not applicable.

Not applicable.

© The Author(s) 2022.

Supplementary Materials

_{2}Si

_{x}Sn

_{1-x }alloys from first principles