.. _tutorial_krr_julia: Kernel ridge regression in Julia ================================= Here we show how to use kernel ridge regression (KRR) in MLatom to train and use your model. This function is based on our implementation in `Julia `_. Prerequisites --------------- - ``MLatom 3.18.0`` or later - ``Julia 1.10.2`` (Julia v1.4+ should work) - ``julia 0.6.2`` (this is a Python package) .. warning:: PyCall.jl in Julia does not support Python 3.12 (June 28th, 2025). It may support later. You can use Python 3.11 instead. Apparently, you should have Julia installed in your local environment. You can download it easily from `its website `_ and install it following the `instructions `_. After you successfully dowanload and configure Julia, you can enter ``julia`` in the command line and open the interactive session. .. code-block:: _ _ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 1.10.2 (2024-03-01) _/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release |__/ | julia> .. note:: As you see, we are using Julia 1.10.2. But to make PyJulia work, you should have at least Julia v1.4+. If you have no knowledge about Julia, it's totally OK, because we call Julia functions from Python. Before that, we need to install such a bridge in Python: .. code-block:: bash pip install julia==0.6.2 Then in Python, run the following codes: .. code-block:: python import julia julia.install() But things are not done yet. The Python interpreter is statically linked to libpython, but PyJulia does not fully support such Python interpreter. We recommend using ``python-jl`` to run your script: .. code-block:: bash python-jl your_script.py If it works well, you will be able to run the following codes: .. code-block:: python from julia import Base print(Base.sind(90)) Data and models ------------------- Here, you can download dataset and models that will be used in the following tutorials. - :download:`E_FCI_451.dat ` - :download:`E_UHF_451.dat ` - :download:`R_451.dat ` ML database -------------- We introduce a new data class called ``ml_database`` You can save x and y values in it. The ``ml.models.krr`` will use ``ml_database.x`` and ``ml_database.y`` to train the model. Here are the examples on how to deal with this class. .. code-block:: python import mlatom as ml mlDB = ml.data.ml_database() mlDB.read_from_x_file(filename="R_451.dat") # This will be saved in mlDB.x mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci") mlDB.read_from_y_file(filename="E_UHF_451.dat",property_name="uhf") mlDB.y = mlDB.get_property("fci") Train a KRR model ------------------- After loading the ML database, we can train a model. .. code-block:: python model = ml.models.krr(model_file="E_FCI.npz",kernel_function='Gaussian',ml_program='mlatom.jl') # The model will be saved in E_FCI.npz. Gaussian function is used as the kernel function. model.hyperparameters['sigma'].value = 4.0 model.hyperparameters['lambda'].value = 0.00001 model.train(ml_database=mlDB,prior=-1.0) # Set the prior value. The prior will be subtracted from each y value before training (and added back after prediction). # Predict values with the model model.predict(ml_database=mlDB,property_to_predict='fci_yest') Now we support several kernel functions: - Gaussian kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-\lvert \textbf{x}_1-\textbf{x}_2 \rvert ^{2} / (2\sigma ^{2}))` - Periodic Gaussian kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-2.0(\text{sin} (\pi \lvert \textbf{x}_1-\textbf{x}_2 \rvert))^2 / \sigma ^2)` - Decaying periodic Gaussian kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp} (-\lvert \textbf{x}_1-\textbf{x}_2 \rvert ^{2} / (2\sigma ^{2}) - 2.0(\text{sin} (\pi \lvert \textbf{x}_1-\textbf{x}_2 \rvert))^2 / \sigma_{p} ^2)` - Matern kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-\lvert \textbf{x}_1-\textbf{x}_2 \rvert / \sigma) \sum \limits _{k=0} ^{n} \frac {(n+k)!} {(2n)!} \begin{pmatrix} n \\ k \end{pmatrix} (2\lvert \textbf{x}_1-\textbf{x}_2 \rvert / \sigma)^{n-k}` Hyperparameters of these kernel functions: .. table:: :align: center ============================= ================================== ==================================== kernel function how to call in MLatom hyperparameters ============================= ================================== ==================================== Gaussian ``'Gaussian'`` ``sigma`` Periodic Gaussian ``'periodic_Gaussian'`` ``sigma``, ``period`` Decaying periodic Gaussian ``'decaying_periodic_Gaussian'`` ``sigma``, ``period``, ``sigmap`` Matern ``'Matern'`` ``sigma``, ``nn`` ============================= ================================== ==================================== Default values of hyperparameters: .. table:: :align: center ================== =============== ======================================= hyperparameters default value description ================== =============== ======================================= ``sigma`` ``1.0`` sigma ``period`` ``2E-35`` period ``sigmap`` ``100.0`` sigmap ``nn`` ``1.0`` nn ``lambda`` ``2`` regularization parameter of KRR ================== =============== ======================================= Load a KRR model ------------------ If the model file exists, it will be loaded automatically during the initialization of the instance. .. code-block:: python import mlatom as ml mlDB = ml.data.ml_database() mlDB.read_from_x_file(filename="R_451.dat") model = ml.models.krr(model_file='E_FCI.npz',ml_program='mlatom.jl') model.predict(ml_database=mlDB,property_to_predict='fci_yest') Optimize hyperparameters ------------------------- Below, we show how to optimize hyperparameters using grid search. .. code-block:: python import mlatom as ml mlDB = ml.data.ml_database() mlDB.read_from_x_file(filename="R_451.dat") # This will be saved in mlDB.x mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci") mlDB.y = mlDB.get_property("fci") # Optimize hyperparameters model = ml.models.krr(model_file="E_FCI_grid_search.npz",kernel_function='Gaussian',ml_program='mlatom.jl') model.hyperparameters['sigma'].minval = 2**-5 # Set the minimum value of sigma model.hyperparameters['sigma'].maxval = 2**9 # Set the maximum value of sigma model.hyperparameters['sigma'].optimization_space = 'log' model.hyperparameters['lambda'].minval = 2**-35 # Set the minimum value of lambda model.hyperparameters['lambda'].maxval = 1.0 # Set the maximum value of lambda model.hyperparameters['lambda'].optimization_space = 'log' sub,val = mlDB.split(number_of_splits=2, fraction_of_points_in_splits=[0.9, 0.1], sampling='none') model.optimize_hyperparameters(subtraining_ml_database=sub,validation_ml_database=val, optimization_algorithm='grid', hyperparameters=['lambda','sigma'], training_kwargs={'prior':-1.0}, prediction_kwargs={'property_to_predict':'yest'},debug=False) model.predict(ml_database=mlDB,property_to_predict='fci_yest') Multi-outputs -------------- The model can be trained on multiple y values at the same time. Below we show how to train a model on both FCI and UHF energies. .. code-block:: python import mlatom as ml mlDB = ml.data.ml_database() mlDB.read_from_x_file(filename="R_451.dat") mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci") mlDB.read_from_y_file(filename="E_UHF_451.dat",property_name="uhf") Ntrain = len(mlDB.y) mlDB.y = np.concatenate((mlDB.get_property('fci').reshape((Ntrain,1)),mlDB.get_property('uhf').reshape((Ntrain,1))),axis=1) model = ml.models.krr(model_file="E_FCI_multi_output.npz",kernel_function='Gaussian',ml_program='mlatom.jl') # Optimize hyperparameters model.hyperparameters['sigma'].minval = 2**-5 # Set the minimum value of sigma model.hyperparameters['sigma'].maxval = 2**9 # Set the maximum value of sigma model.hyperparameters['sigma'].optimization_space = 'log' model.hyperparameters['lambda'].minval = 2**-35 # Set the minimum value of lambda model.hyperparameters['lambda'].maxval = 1.0 # Set the maximum value of lambda model.hyperparameters['lambda'].optimization_space = 'log' sub,val = mlDB.split(number_of_splits=2, fraction_of_points_in_splits=[0.9, 0.1], sampling='none') model.optimize_hyperparameters(subtraining_ml_database=sub,validation_ml_database=val, optimization_algorithm='grid', hyperparameters=['lambda','sigma'], training_kwargs={'prior':-1.0}, prediction_kwargs={'property_to_predict':'yest'},debug=False) model.predict(ml_database=mlDB,property_to_predict='fci_uhf_yest')