feat: databricks deltalake dataset support #920

akoumpa · 2025-12-03T19:18:44Z

No description provided.

copy-pr-bot · 2025-12-03T19:18:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

floraxhuang · 2025-12-10T03:56:37Z

Thanks for the PR! I tested this PR on a Databricks cluster and encountered a DeltaProtocolError. It appears the deltalake reader used here isn't compatible with tables that use Deletion Vectors or Column Mapping (which are now enabled by default on Databricks). Relevant discussions: delta-io/delta-rs#1094

Databricks runtime: 15.4 LTS
deltalake version: 1.2.1
Error Trace:

DeltaProtocolError: The table has set these reader features: {'deletionVectors'} but these are not yet supported by the deltalake reader.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-01-21T16:02:39Z

/ok to test 937cf25

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-01-21T16:07:28Z

/ok to test 7ca3800

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-01-22T18:29:08Z

/ok to test 2ce5288

adil-a · 2026-01-22T20:53:23Z

nemo_automodel/components/datasets/llm/delta_lake_dataset.py

+    global _DELTALAKE_AVAILABLE
+    if _DELTALAKE_AVAILABLE is None:
+        try:
+            import deltalake  # noqa: F401


can we clean up the noqa stuff

the noqa is necessary otherwise ruff will complain

adil-a · 2026-01-22T21:54:01Z

nemo_automodel/components/datasets/llm/delta_lake_dataset.py

+        )
+
+
+class HFDeltaLakeDataset:


this class doesn't implement the select method. same for DeltaLakeDataset. without this, the user cannot limit the number of data samples that are loaded

adil-a · 2026-01-22T21:54:53Z

nemo_automodel/components/datasets/llm/delta_lake_dataset.py

+        self._shard_info = (num_shards, index)
+        return self
+
+    def shuffle(self, buffer_size: int = 1000, seed: Optional[int] = None) -> "HFDeltaLakeDataset":


i don't see shuffle being used in the iterator. how is shuffling happening?

jgerh

Completed tech pubs review and provided some copyedits and suggestions.

jgerh · 2026-01-23T22:03:36Z

docs/guides/dataset-overview.md

@@ -1,3 +1,3 @@
 # Dataset Overview: LLM and VLM Datasets in NeMo Automodel

 This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.


Suggested change

This page summarizes the datasets supported in NeMo Automodel for LLMs and VLMs and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.

jgerh · 2026-01-23T22:04:12Z

docs/guides/dataset-overview.md


 - **HellaSwag (completion SFT)**
  - Wrapper: `nemo_automodel.components.datasets.llm.hellaswag.HellaSwag`
  - Use case: single-turn completion style SFT where a prompt (ctx) is followed by a gold continuation (ending)


Suggested change

- Use case: single-turn, completion-style SFT where a prompt (context) is followed by a gold continuation

jgerh · 2026-01-23T22:05:36Z

docs/guides/dataset-overview.md

  - Use case: multi-turn conversations and tool calling in OpenAI chat format
  - Sources: local JSON/JSONL or Hugging Face Hub dataset ID
  - Key args:
    - `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID


Suggested change

- `path_or_dataset_id`: path to local file(s) or Hugging Face dataset ID

jgerh · 2026-01-23T22:06:51Z

docs/guides/dataset-overview.md

  - Sources: local JSON/JSONL or Hugging Face Hub dataset ID
  - Key args:
    - `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
    - `tokenizer`: tokenizer instance (required. Must have chat template support)


Suggested change

- `tokenizer`: tokenizer instance (required; must have chat template support)

jgerh · 2026-01-23T22:07:34Z

docs/guides/dataset-overview.md

  packed_sequence_size: 8192   # > 0 enables packing
  split_across_pack: false
 ```
 Use a collater that pads to an FP8-friendly multiple when training with FP8:


Suggested change

Use a collate function that pads to an FP8-friendly multiple when training with FP8:

jgerh · 2026-01-23T22:45:22Z