.. _tutorial_krr_julia:
Kernel ridge regression in Julia
=================================
Here we show how to use kernel ridge regression (KRR) in MLatom to train and use your model. This function is based on our implementation in `Julia `_.
Prerequisites
---------------
- ``MLatom 3.18.0`` or later
- ``Julia 1.10.2`` (Julia v1.4+ should work)
- ``julia 0.6.2`` (this is a Python package)
.. warning::
PyCall.jl in Julia does not support Python 3.12 (June 28th, 2025). It may support later. You can use Python 3.11 instead.
Apparently, you should have Julia installed in your local environment. You can download it easily from `its website `_ and install it following the `instructions `_.
After you successfully dowanload and configure Julia, you can enter ``julia`` in the command line and open the interactive session.
.. code-block::
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.10.2 (2024-03-01)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia>
.. note::
As you see, we are using Julia 1.10.2. But to make PyJulia work, you should have at least Julia v1.4+.
If you have no knowledge about Julia, it's totally OK, because we call Julia functions from Python. Before that, we need to install such a bridge in Python:
.. code-block:: bash
pip install julia==0.6.2
Then in Python, run the following codes:
.. code-block:: python
import julia
julia.install()
But things are not done yet. The Python interpreter is statically linked to libpython, but PyJulia does not fully support such Python interpreter. We recommend using ``python-jl`` to run your script:
.. code-block:: bash
python-jl your_script.py
If it works well, you will be able to run the following codes:
.. code-block:: python
from julia import Base
print(Base.sind(90))
Data and models
-------------------
Here, you can download dataset and models that will be used in the following tutorials.
- :download:`E_FCI_451.dat `
- :download:`E_UHF_451.dat `
- :download:`R_451.dat `
ML database
--------------
We introduce a new data class called ``ml_database`` You can save x and y values in it. The ``ml.models.krr`` will use ``ml_database.x`` and ``ml_database.y`` to train the model.
Here are the examples on how to deal with this class.
.. code-block:: python
import mlatom as ml
mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat") # This will be saved in mlDB.x
mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci")
mlDB.read_from_y_file(filename="E_UHF_451.dat",property_name="uhf")
mlDB.y = mlDB.get_property("fci")
Train a KRR model
-------------------
After loading the ML database, we can train a model.
.. code-block:: python
model = ml.models.krr(model_file="E_FCI.npz",kernel_function='Gaussian',ml_program='mlatom.jl') # The model will be saved in E_FCI.npz. Gaussian function is used as the kernel function.
model.hyperparameters['sigma'].value = 4.0
model.hyperparameters['lambda'].value = 0.00001
model.train(ml_database=mlDB,prior=-1.0) # Set the prior value. The prior will be subtracted from each y value before training (and added back after prediction).
# Predict values with the model
model.predict(ml_database=mlDB,property_to_predict='fci_yest')
Now we support several kernel functions:
- Gaussian kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-\lvert \textbf{x}_1-\textbf{x}_2 \rvert ^{2} / (2\sigma ^{2}))`
- Periodic Gaussian kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-2.0(\text{sin} (\pi \lvert \textbf{x}_1-\textbf{x}_2 \rvert))^2 / \sigma ^2)`
- Decaying periodic Gaussian kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp} (-\lvert \textbf{x}_1-\textbf{x}_2 \rvert ^{2} / (2\sigma ^{2}) - 2.0(\text{sin} (\pi \lvert \textbf{x}_1-\textbf{x}_2 \rvert))^2 / \sigma_{p} ^2)`
- Matern kernel function: :math:`k(\textbf{x}_1,\textbf{x}_2) = \text{exp}(-\lvert \textbf{x}_1-\textbf{x}_2 \rvert / \sigma) \sum \limits _{k=0} ^{n} \frac {(n+k)!} {(2n)!} \begin{pmatrix} n \\ k \end{pmatrix} (2\lvert \textbf{x}_1-\textbf{x}_2 \rvert / \sigma)^{n-k}`
Hyperparameters of these kernel functions:
.. table::
:align: center
============================= ================================== ====================================
kernel function how to call in MLatom hyperparameters
============================= ================================== ====================================
Gaussian ``'Gaussian'`` ``sigma``
Periodic Gaussian ``'periodic_Gaussian'`` ``sigma``, ``period``
Decaying periodic Gaussian ``'decaying_periodic_Gaussian'`` ``sigma``, ``period``, ``sigmap``
Matern ``'Matern'`` ``sigma``, ``nn``
============================= ================================== ====================================
Default values of hyperparameters:
.. table::
:align: center
================== =============== =======================================
hyperparameters default value description
================== =============== =======================================
``sigma`` ``1.0`` sigma
``period`` ``2E-35`` period
``sigmap`` ``100.0`` sigmap
``nn`` ``1.0`` nn
``lambda`` ``2`` regularization parameter of KRR
================== =============== =======================================
Load a KRR model
------------------
If the model file exists, it will be loaded automatically during the initialization of the instance.
.. code-block:: python
import mlatom as ml
mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat")
model = ml.models.krr(model_file='E_FCI.npz',ml_program='mlatom.jl')
model.predict(ml_database=mlDB,property_to_predict='fci_yest')
Optimize hyperparameters
-------------------------
Below, we show how to optimize hyperparameters using grid search.
.. code-block:: python
import mlatom as ml
mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat") # This will be saved in mlDB.x
mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci")
mlDB.y = mlDB.get_property("fci")
# Optimize hyperparameters
model = ml.models.krr(model_file="E_FCI_grid_search.npz",kernel_function='Gaussian',ml_program='mlatom.jl')
model.hyperparameters['sigma'].minval = 2**-5 # Set the minimum value of sigma
model.hyperparameters['sigma'].maxval = 2**9 # Set the maximum value of sigma
model.hyperparameters['sigma'].optimization_space = 'log'
model.hyperparameters['lambda'].minval = 2**-35 # Set the minimum value of lambda
model.hyperparameters['lambda'].maxval = 1.0 # Set the maximum value of lambda
model.hyperparameters['lambda'].optimization_space = 'log'
sub,val = mlDB.split(number_of_splits=2, fraction_of_points_in_splits=[0.9, 0.1], sampling='none')
model.optimize_hyperparameters(subtraining_ml_database=sub,validation_ml_database=val,
optimization_algorithm='grid',
hyperparameters=['lambda','sigma'],
training_kwargs={'prior':-1.0},
prediction_kwargs={'property_to_predict':'yest'},debug=False)
model.predict(ml_database=mlDB,property_to_predict='fci_yest')
Multi-outputs
--------------
The model can be trained on multiple y values at the same time. Below we show how to train a model on both FCI and UHF energies.
.. code-block:: python
import mlatom as ml
mlDB = ml.data.ml_database()
mlDB.read_from_x_file(filename="R_451.dat")
mlDB.read_from_y_file(filename="E_FCI_451.dat",property_name="fci")
mlDB.read_from_y_file(filename="E_UHF_451.dat",property_name="uhf")
Ntrain = len(mlDB.y)
mlDB.y = np.concatenate((mlDB.get_property('fci').reshape((Ntrain,1)),mlDB.get_property('uhf').reshape((Ntrain,1))),axis=1)
model = ml.models.krr(model_file="E_FCI_multi_output.npz",kernel_function='Gaussian',ml_program='mlatom.jl')
# Optimize hyperparameters
model.hyperparameters['sigma'].minval = 2**-5 # Set the minimum value of sigma
model.hyperparameters['sigma'].maxval = 2**9 # Set the maximum value of sigma
model.hyperparameters['sigma'].optimization_space = 'log'
model.hyperparameters['lambda'].minval = 2**-35 # Set the minimum value of lambda
model.hyperparameters['lambda'].maxval = 1.0 # Set the maximum value of lambda
model.hyperparameters['lambda'].optimization_space = 'log'
sub,val = mlDB.split(number_of_splits=2, fraction_of_points_in_splits=[0.9, 0.1], sampling='none')
model.optimize_hyperparameters(subtraining_ml_database=sub,validation_ml_database=val,
optimization_algorithm='grid',
hyperparameters=['lambda','sigma'],
training_kwargs={'prior':-1.0},
prediction_kwargs={'property_to_predict':'yest'},debug=False)
model.predict(ml_database=mlDB,property_to_predict='fci_uhf_yest')