.. _tutorial_tsh_adl:

Active delta-learning for TSH
=============================

This tutorial demonstrates how to run **active delta-learning (ADL)** for nonadiabatic molecular dynamics in MLatom. In delta-learning, the ML model predicts the *difference* between a high-level target method and a cheap baseline, rather than the full potential energy surface. Combined with active learning (AL), this yields dramatically faster convergence and better data efficiency than pure ML or transfer learning approaches — particularly for challenging systems with large conformational spaces.

For the full methodology and benchmarks, see the ADL paper and its GitHub repository:

- Mikołaj Martyka, Joanna Jankowska*, Pavlo O. Dral*. Active Delta-Learning for Surface Hopping: Efficient Fitting of Potential Energy Surfaces. 2026. Preprint on ChemRxiv: https://doi.org/10.26434/chemrxiv.15002308/v1 (2026.04.22).
- https://github.com/dralgroup/adl-namd.

For the standard (non-delta) active learning tutorials, see:

- :ref:`ATL-NAMD: Active transfer learning for NAMD via fine-tuning of OMNI-P2x <omnip2x_atl_namd>` — fine-tuning a pre-trained universal potential (recommended for non-ADL AL).
- :ref:`AL-NAMD <tutorial_tsh_al>` — active learning from scratch with a pure ML model.

In this tutorial, we use fulvene as the worked example, targeting CASSCF(6,6) accuracy with ODM2/MRCI as the baseline. The same procedure applies to any target/baseline combination available in MLatom.

Background
----------

Standard active learning for ML-NAMD trains a neural network to predict the full excited-state PES from scratch (or from pre-trained weights like OMNI-P2x). While this works well for simple molecules, it can struggle with systems that have large conformational spaces or multiple competing reaction channels — the ML model must learn the entire complexity of the PES, which requires extensive sampling.

Delta-learning addresses this by changing the learning target. Instead of predicting the full energy and gradients, the model learns the correction between a fast baseline method and the expensive target:

:math:`\Delta E = E_\text{target} - E_\text{baseline}`

At inference, the baseline calculation is performed at each time step, and the ML correction is added on top. This has two key advantages: (1) the correction surface is typically much smoother than the full PES, requiring fewer training points, and (2) the baseline provides a physically reasonable fallback — even where the ML model extrapolates, the prediction stays anchored to a qualitatively correct PES rather than diverging to unphysical values.

For fulvene, ADL achieves convergence in ~5 iterations (1500 training points), compared to ~11 iterations for fine-tuning from OMNI-P2x and ~19 for training from scratch. For the more challenging PSB4 system, pure ML active learning fails to converge even after 50 iterations, while ADL converges in just 2.

The trade-off is that each time step during dynamics requires a baseline QM calculation (here ODM2/MRCI), making individual trajectories somewhat slower. However, the overall wall-clock time is reduced because far fewer AL iterations are needed.

Prerequisites
-------------

- **MLatom** ≥ 3.22 (``pip install --upgrade mlatom``)
- **PyTorch** with CUDA support (recommended)
- A target reference method interfaced with MLatom (here: COLUMBUS for CASSCF)
- A baseline method interfaced with MLatom (here: MNDO for ODM2/MRCI)
- An equilibrium geometry saved as ``eqmol.json``
- Input files for both methods (here: ``columbus/`` directory for CASSCF, ``mndokw`` for ODM2)

Input Script
------------

Below is the complete input script. We walk through each section afterward.

.. code-block:: python

    import mlatom as ml
    import torch
    import os

    torch.cuda.empty_cache()
    root_dir = os.getcwd()

    # 1. Target method (high-level)
    col = ml.models.methods(
        program='columbus',
        command_line_arguments=['-m', '1700'],
        directory_with_input_files=f'{root_dir}/columbus',
        save_files_in_current_directory=False,
    )
    col_kwargs = {
        'nstates': 2,
        'current_state': 0,
        'calculate_energy_gradients': [True] * 2,
    }

    # 2. Baseline method (cheap)
    odm2 = ml.models.methods(
        method='ODM2',
        save_files_in_current_directory=False,
        read_keywords_from_file='mndokw',
    )
    odm2_kwargs = {
        'nstates': 2,
        'calculate_energy_gradients': [True, True],
    }

    # 3. Load equilibrium molecule
    eqmol = ml.molecule.load('eqmol.json', format='json')

    # 4. Sampler settings
    init_sampler_kwargs = {
        'eqmol': eqmol,
        'initial_temperature': 0,
        'number_of_initial_conditions': 50,
    }

    sampler_kwargs = {
        'maximum_propagation_time': 60.0,
        'time_step': 0.1,
        'nstates': 2,
        'initial_state': 1,
        'prevent_back_hop': False,
        'dump_trajectory_interval': 10,
        'reduce_memory_usage': True,
        'min_surfuq_points': 2,
        'ngapmd_trajs': 50,
        'max_hopprobuq_points': 15,
        'prob_uq_threshold': 0.1,
        'plot_populations': False,
        'stop_uncertain_md': True,
        'initcond_sampler_kwargs': init_sampler_kwargs,
        'proceed_on_error': True,
    }

    # 5. Initial dataset sampling
    init_dataset_sampler_kwargs = {
        'molecule': eqmol,
        'initial_temperature': 0,
        'number_of_initial_conditions': 300,
    }

    # 6. ML model settings — delta-learning
    ml_model_kwargs = {
        'nstates': 2,
        'gap_weight': 0.0,
        'baseline': odm2,
        'baseline_kwargs': odm2_kwargs,
    }

    # 7. Run active delta-learning
    ml.al(
        molecule=eqmol,
        reference_method=col,
        ml_model='delta_msani',
        baseline_method=odm2,
        baseline_method_kwargs=odm2_kwargs,
        sampler='lzsh',
        initial_points_refinement='one-shot',
        initdata_sampler='wigner',
        initdata_sampler_kwargs=init_dataset_sampler_kwargs,
        sampler_kwargs=sampler_kwargs,
        refmethod_kwargs=col_kwargs,
        nthreads=12,
        ml_model_kwargs=ml_model_kwargs,
        new_points=300,
    )

Walkthrough
-----------

1–2. Target and baseline methods
++++++++++++++++++++++++++++++++

The key difference from the standard AL tutorial is that ADL requires **two** electronic structure methods:

- **Target** (``col``): The expensive, accurate method you want to emulate — here CASSCF(6,6) via COLUMBUS. Training data is labeled with this method.
- **Baseline** (``odm2``): A fast, approximate method — here ODM2/MRCI. This is called at every time step during dynamics propagation, and the ML model learns the correction on top of it.

The baseline should be qualitatively correct (right number of states, reasonable topology near conical intersections) but need not be quantitatively accurate. Semiempirical MRCI methods like ODM2/MRCI are ideal baselines because they are fast, handle multireference character, and provide analytical gradients.

3–5. Molecule, sampler, and initial dataset
++++++++++++++++++++++++++++++++++++++++++++

These are identical to the standard AL tutorial. The initial dataset is generated via iterative Wigner sampling with a statistical convergence criterion (see the :ref:`ATL-NAMD tutorial <tutorial_tsh_al>` for details).

6. ML model settings — what's different
++++++++++++++++++++++++++++++++++++++++

The ``ml_model_kwargs`` configure the delta-learning model:

- ``'baseline': odm2`` and ``'baseline_kwargs': odm2_kwargs`` — tell the model to compute ΔE = E_target − E_baseline during training, and to add the baseline prediction back at inference.
- ``'gap_weight': 0.0`` — the gap loss is set to zero. This is because the delta-learning model already benefits from the baseline's correct gap ordering; adding a gap penalty on top of the correction is unnecessary and can hurt convergence.

7. The ``ml.al()`` call — key differences
+++++++++++++++++++++++++++++++++++++++++++

Three arguments distinguish ADL from standard AL:

- ``ml_model='delta_msani'`` — uses the delta-learning variant of the MS-ANI architecture, instead of ``'vecmsani'`` (pure ML) or the fine-tuning variant.
- ``baseline_method=odm2`` and ``baseline_method_kwargs=odm2_kwargs`` — passed to ``ml.al()`` so the AL loop knows to use the baseline for trajectory propagation.
- ``initial_points_refinement='one-shot'`` — uses a single-pass initial dataset generation instead of the iterative refinement used in standard AL. This is appropriate for delta-learning because the correction surface is smoother and less sensitive to the initial dataset composition.

Understanding the Output
------------------------

:download:`Download the archive with all input and output files <tutorial_files/tutorial_namd/adl_run.tar.gz>` for this tutorial. Extract it to your working directory:

- Input script ``al.py`` (see also below).
- Sample output log ``train.out`` from a run of the script above.
- An equilibrium geometry ``eqmol.json`` in MLatom's JSON format.
- Initial conditions database ``init_cond_db.json`` for production NAMD runs.
- Active learning info ``al_info.json``.
- Input files for the reference CASSCF method (``columbus/`` directory with COLUMBUS inputs).
- Input keywords for the baseline ODM2 method (``mndokw``).
- Iteration snapshots (``iterations/``) containing trained models (``mlmodel.pt``, ``aux_mlmodel.pt``), training and labeling databases (``training_db.json``, ``labeled_db.json``, ``db_to_label.json``), and diagnostic plots (``mlmodel_energies.png``, ``mlmodel_energy_gradients.png``, ``aux_mlmodel_energies.png``) for each active-learning iteration 0–8.

The output log follows the same structure as the standard AL tutorial, with one notable difference in the threshold reporting.

Thresholds from pure ML
++++++++++++++++++++++++

At the start of each iteration, the thresholds for uncertainty quantification are reported:

.. code-block::

    New threshold (pure ML): 0.01464623147919774
    New threshold (pure ML): 0.010255664895847439

The label "pure ML" is important: the UQ thresholds are derived from a pure-ML model trained on the initial data pool, *not* from the delta-ML model. Using delta-ML for thresholding would produce thresholds that are too tight (because the corrections are small), causing the AL loop to stop trajectories too aggressively and preventing convergence.

Convergence behavior
++++++++++++++++++++

In this fulvene example, ADL converged in 9 iterations (iterations 0–8), using 2700 CASSCF-labeled geometries. The convergence progression:

+-------------+-------------------+-------------------+
| Iteration   | Training points   | Max convergence   |
+=============+===================+===================+
| 0           | 300               | 2%                |
+-------------+-------------------+-------------------+
| 1           | 600               | 4%                |
+-------------+-------------------+-------------------+
| 2           | 900               | 28%               |
+-------------+-------------------+-------------------+
| 3           | 1200              | 74%               |
+-------------+-------------------+-------------------+
| 4           | 1500              | 84%               |
+-------------+-------------------+-------------------+
| 5           | 1800              | 70%               |
+-------------+-------------------+-------------------+
| 6           | 2100              | 68%               |
+-------------+-------------------+-------------------+
| 7           | 2400              | 82%               |
+-------------+-------------------+-------------------+
| 8           | 2700              | 98% → converged   |
+-------------+-------------------+-------------------+

Note the more monotonic convergence compared to pure ML (which showed dramatic drops between iterations in the :ref:`ATL-NAMD tutorial <tutorial_tsh_al>`). This is a characteristic advantage of delta-learning: the baseline anchors the PES, preventing the catastrophic extrapolation failures that cause convergence regressions in pure ML.

Timing
++++++

The total wall-clock time was approximately 25 hours. The breakdown per iteration shows that **sampling time dominates** in ADL (unlike pure ML where training dominates), because each trajectory propagation step requires a baseline ODM2/MRCI calculation:

.. code-block::

    Iteration 8 takes 16173.67 s
      Labeling time: 1353.63 s
      Training time: 5208.44 s
      Sampling time: 9611.61 s

Analyzing the output
--------------------

The same analysis script from the :ref:`ATL-NAMD tutorial <tutorial_tsh_al>` works here. The only change is the number of trajectories (``ntraj=50``) and the maximum propagation time (60 fs):

.. code-block:: python

    """
    Parse an ADL output log and plot convergence diagnostics.
    Usage: python plot_al_convergence.py train_combined.txt
    """
    import re, sys, matplotlib.pyplot as plt
    import numpy as np

    def parse_al_log(path, ntraj=50, max_time=60.0):
        with open(path) as f:
            text = f.read()
        bounds = [(int(m.group(1)), m.start())
                  for m in re.finditer(r'Active learning iteration (\d+)', text)]
        bounds.append((-1, len(text)))

        iters, max_convs, avg_stops = [], [], []
        for i in range(len(bounds) - 1):
            block = text[bounds[i][1]:bounds[i+1][1]]
            convs = [float(m.group(1))
                     for m in re.finditer(r'The model is ([\d.]+) % converged', block)]
            st_blocks = re.findall(
                r'The surface stopping times are:\n(.*?)The model is', block, re.DOTALL)
            check_avgs = []
            for sb in st_blocks:
                uncertain = [float(x) for x in sb.strip().split('\n') if x.strip()]
                n_uncertain = len(uncertain)
                n_converged = ntraj - n_uncertain
                all_times = uncertain + [max_time] * n_converged
                check_avgs.append(sum(all_times) / len(all_times))
            iters.append(bounds[i][0])
            max_convs.append(max(convs) if convs else 0)
            avg_stops.append(sum(check_avgs) / len(check_avgs) if check_avgs else 0)
        return iters, max_convs, avg_stops

    if __name__ == '__main__':
        path = sys.argv[1] if len(sys.argv) > 1 else 'train_combined.txt'
        iters, max_convs, avg_stops = parse_al_log(path)

        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
        ax1.plot(iters, max_convs, 'o-', color='#2563eb')
        ax1.set_xlabel('AL iteration')
        ax1.set_ylabel('Max convergence (%)')
        ax1.set_ylim(0, 105)
        ax1.set_title('Convergence rate per iteration')

        ax2.plot(iters, avg_stops, 'o-', color='#dc2626')
        ax2.set_xlabel('AL iteration')
        ax2.set_ylabel('Avg. surface stopping time (fs)')
        ax2.set_title('Average surface stopping time')
        ax2.axhline(60, ls='--', color='gray', label='max propagation time')
        ax2.legend()

        plt.tight_layout()
        plt.savefig('adl_convergence.png', dpi=150)
        plt.show()

When to use delta-learning vs. transfer learning
-------------------------------------------------

+--------------------------------------+------------------------------------------+------------------------------------------+
|                                      | Delta-learning (ADL)                     | Transfer learning (ATL)                  |
+======================================+==========================================+==========================================+
| **Requires baseline method**         | Yes — called at every time step          | No                                       |
+--------------------------------------+------------------------------------------+------------------------------------------+
| **Best for**                         | Large conformational spaces, reactive    | Small/medium molecules, non-reactive     |
|                                      | systems, difficult convergence           | dynamics                                 |
+--------------------------------------+------------------------------------------+------------------------------------------+
| **Data efficiency**                  | Highest (~2× better than ATL)            | Good (~3× better than pure ML)           |
+--------------------------------------+------------------------------------------+------------------------------------------+
| **Wall-clock per trajectory**        | Slower (baseline QM at each step)        | Faster (pure NN inference)               |
+--------------------------------------+------------------------------------------+------------------------------------------+
| **Convergence stability**            | Most stable (monotonic)                  | Good, but can regress                    |
+--------------------------------------+------------------------------------------+------------------------------------------+
| **Pre-trained weights needed**       | No                                       | Yes (OMNI-P2x)                           |
+--------------------------------------+------------------------------------------+------------------------------------------+

**Rule of thumb**: If pure ML or fine-tuning from OMNI-P2x converges within a reasonable number of iterations, use them — the resulting model is faster at inference. If convergence is slow or fails, switch to ADL. For systems with multiple competing reaction channels (like PSB4's four torsional modes), ADL is likely the only viable automated option.

Tips
----

- **Baseline quality matters**: The baseline should  produce qualitatively reasonable PES topology. ODM2/MRCI and AIQM1/MRCI are good choices for organic molecules. A baseline that misses a conical intersection entirely, like TD-DFTB will not help.
- **UQ thresholds from pure ML**: The code automatically computes thresholds from a pure-ML model. If you see the AL loop stopping trajectories too aggressively (all trajectories crash at step 1), the thresholds may be too tight — this is a sign that delta-ML thresholds leaked in.
- **``gap_weight=0.0``**: Unlike pure ML where gap weighting is important, delta-learning works best without it. The baseline already provides the correct gap ordering.
- **``proceed_on_error=True``**: For baseline methods that occasionally fail (SCF convergence), this flag lets the AL loop skip failed points rather than crashing.

Running NAMD with the trained delta model
-----------------------------------------

Once active learning has converged, the final-iteration model files (``mlmodel.pt`` and ``aux_mlmodel.pt``) sit under ``iterations/<N>/``. Copy or rename the main model to something memorable — e.g. ``fulvene_delta_model.pt`` — and you are ready to run production trajectories.

Because the AL loop trains the MS-ANI model on the correction surface, the saved model is loaded just like any standard MS-ANI model. The baseline must be added back at every step by a small wrapper that exposes the same ``predict`` interface that ``surface_hopping_md`` expects.

.. code-block:: python

    import mlatom as ml

    # Baseline (same method and keywords as used during AL)
    odm2 = ml.models.methods(method='ODM2', read_keywords_from_file='mndokw')

    # Trained delta model (main MS-ANI head from the converged AL iteration)
    model = ml.models.msani(model_file='fulvene_delta_model.pt', nstates=2)


    class delta_model_namd():
        def __init__(self, qm_method, ml_model):
            self.qm_method = qm_method
            self.ml_model = ml_model

        def predict(self, molecule=None, nstates=5, current_state=0,
                    calculate_energy=True, calculate_energy_gradients=True):
            # 1. ML prediction of ΔE and ΔE-gradients (stored in molecule.electronic_states)
            self.ml_model.predict(
                molecule=molecule,
                calculate_energy=True,
                calculate_energy_gradients=[True] * nstates,
                current_state=current_state,
                nstates=nstates,
            )
            # 2. Baseline QM calculation on a fresh copy of the geometry
            tempmol = molecule.copy(atomic_labels=['xyz_coordinates'])
            self.qm_method.predict(
                molecule=tempmol,
                calculate_energy=True,
                calculate_energy_gradients=[True] * nstates,
                nstates=nstates,
                current_state=0,
            )
            # 3. Add baseline back to recover absolute energies and gradients
            for istate in range(nstates):
                molecule.electronic_states[istate].energy += tempmol.electronic_states[istate].energy
                molecule.electronic_states[istate].energy_gradients += tempmol.electronic_states[istate].energy_gradients

            molecule.energy = molecule.electronic_states[current_state].energy
            molecule.energy_gradients = molecule.electronic_states[current_state].energy_gradients


    delta_model = delta_model_namd(model, odm2)

    # Initial conditions — typically a larger Wigner ensemble than the AL init pool
    init_cond_db = ml.data.molecular_database.load('init_cond_db_fulvene.json', format='json')

    namd_kwargs = {
        'model': delta_model,
        'time_step': 0.1,                  # fs
        'maximum_propagation_time': 60,    # fs
        'dump_trajectory_interval': 30,
        'filename': 'traj.h5',
        'format': 'h5md',
        'hopping_algorithm': 'LZBL',
        'initial_state': 1,
        'nstates': 2,
        'reduce_kinetic_energy': True,
        'reduce_memory_usage': True,
    }

    dyns = ml.simulations.run_in_parallel(
        molecular_database=init_cond_db[:1000],
        task=ml.namd.surface_hopping_md,
        task_kwargs=namd_kwargs,
        create_and_keep_temp_directories=True,
        proceed_on_error=True,
    )

Notes
+++++

- **Match the baseline exactly.** The keywords file (``mndokw``), method, and active space passed to ``ml.models.methods`` must be identical to what was used during AL labeling, otherwise the correction the model learned no longer matches the baseline.
- **Initial conditions.** The 300-point Wigner pool used during AL is for *training*, not for production trajectories. Generate a fresh, larger ensemble (typically 500–1000 points) and save it to ``init_cond_db_fulvene.json`` before launching the swarm. See the NAMD tutorial for examples.
- **Cost.** Each MD step now performs one ODM2/MRCI call in addition to the ML inference, so trajectories run slower than with a pure-ML model. Use ``run_in_parallel`` and ``proceed_on_error=True`` — occasional baseline SCF failures should not bring down the whole job.
- **Trajectory output.** Trajectories are dumped to HDF5 (``traj.h5``, h5md format) and can be analyzed with the standard MLatom NAMD analysis tools (population plots, hopping geometries, etc.).

References
----------

- Mikołaj Martyka, Joanna Jankowska*, Pavlo O. Dral*. Active Delta-Learning for Surface Hopping: Efficient Fitting of Potential Energy Surfaces. 2026. Preprint on ChemRxiv: https://doi.org/10.26434/chemrxiv.15002308/v1 (2026.04.22). Data and code: `github.com/dralgroup/adl-namd <https://github.com/dralgroup/adl-namd>`__.
- Active learning for NAMD: Martyka, M. et al. *Charting electronic-state manifolds across molecules with multi-state learning and gap-driven dynamics via efficient and robust active learning.* npj Comput. Mater. 11, 132 (2025). `DOI: 10.1038/s41524-025-01636-z <https://doi.org/10.1038/s41524-025-01636-z>`__.
- Delta-learning: Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. *Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach.* J. Chem. Theory Comput. 11, 2087–2096 (2015).
- OMNI-P2x: Martyka, M.; Tong, X.-Y.; Jankowska, J.; Dral, P. O. *OMNI-P2x universal neural network potential for excited-state simulations.* Nat. Commun. (2026).