Skip to content

Conversation

@akoumpa
Copy link
Contributor

@akoumpa akoumpa commented Dec 3, 2025

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa changed the title deltalake support feat: deltalake dataset support Dec 3, 2025
@akoumpa akoumpa linked an issue Dec 3, 2025 that may be closed by this pull request
@floraxhuang
Copy link

floraxhuang commented Dec 10, 2025

Thanks for the PR! I tested this PR on a Databricks cluster and encountered a DeltaProtocolError. It appears the deltalake reader used here isn't compatible with tables that use Deletion Vectors or Column Mapping (which are now enabled by default on Databricks). Relevant discussions: delta-io/delta-rs#1094

  • Databricks runtime: 15.4 LTS

  • deltalake version: 1.2.1

  • Error Trace:

DeltaProtocolError: The table has set these reader features: {'deletionVectors'} but these are not yet supported by the deltalake reader.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from 3df865b to 0936f94 Compare January 15, 2026 13:43
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from 99fb5e1 to b990139 Compare January 15, 2026 15:28
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from 1c75572 to b8efd47 Compare January 15, 2026 15:58
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/feat_streaming_deltalake branch from dd66497 to caa8a6c Compare January 15, 2026 16:42
akoumpa and others added 2 commits January 15, 2026 16:43
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa changed the title feat: deltalake dataset support feat: databricks deltalake dataset support Jan 21, 2026
@akoumpa
Copy link
Contributor Author

akoumpa commented Jan 21, 2026

/ok to test 937cf25

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Contributor Author

akoumpa commented Jan 21, 2026

/ok to test 7ca3800

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Contributor Author

akoumpa commented Jan 22, 2026

/ok to test 2ce5288

@akoumpa akoumpa marked this pull request as ready for review January 22, 2026 20:01
@akoumpa akoumpa requested review from a team, adil-a and jgerh as code owners January 22, 2026 20:01
@akoumpa akoumpa enabled auto-merge (squash) January 22, 2026 20:01
global _DELTALAKE_AVAILABLE
if _DELTALAKE_AVAILABLE is None:
try:
import deltalake # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we clean up the noqa stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the noqa is necessary otherwise ruff will complain

)


class HFDeltaLakeDataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class doesn't implement the select method. same for DeltaLakeDataset. without this, the user cannot limit the number of data samples that are loaded

self._shard_info = (num_shards, index)
return self

def shuffle(self, buffer_size: int = 1000, seed: Optional[int] = None) -> "HFDeltaLakeDataset":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see shuffle being used in the iterator. how is shuffling happening?

Copy link
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review and provided some copyedits and suggestions.

@@ -1,3 +1,3 @@
# Dataset Overview: LLM and VLM Datasets in NeMo Automodel

This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This page summarizes the datasets supported in NeMo Automodel for LLMs and VLMs and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.


- **HellaSwag (completion SFT)**
- Wrapper: `nemo_automodel.components.datasets.llm.hellaswag.HellaSwag`
- Use case: single-turn completion style SFT where a prompt (ctx) is followed by a gold continuation (ending)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Use case: single-turn, completion-style SFT where a prompt (context) is followed by a gold continuation

- Use case: multi-turn conversations and tool calling in OpenAI chat format
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
- Key args:
- `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `path_or_dataset_id`: path to local file(s) or Hugging Face dataset ID

- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
- Key args:
- `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
- `tokenizer`: tokenizer instance (required. Must have chat template support)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `tokenizer`: tokenizer instance (required; must have chat template support)

packed_sequence_size: 8192 # > 0 enables packing
split_across_pack: false
```
Use a collater that pads to an FP8-friendly multiple when training with FP8:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Use a collate function that pads to an FP8-friendly multiple when training with FP8:

Comment on lines 7 to 9
* Quick prototyping across diverse instruction datasets
* Schema flexibility without needing codebase changes
* Consistent field names for training loops, regardless of dataset source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Consistent field names for training loops, regardless of dataset source
- Quick prototyping across diverse instruction datasets
- Schema flexibility without requiring code changes
- Consistent field names for training loops, regardless of dataset source

break
```

:::note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::note
:::{note}


### Multi-Node Slurm Configuration

:::note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{note}

### Multi-Node Slurm Configuration

:::note
**Note for Multi-Node Training**: When using Hugging Face datasets in multi-node setups, you need shared storage accessible by all nodes. Set the `HF_DATASETS_CACHE` environment variable to point to a shared directory (e.g., `HF_DATASETS_CACHE=/shared/hf_cache`) in the yaml file as shown, to ensure all nodes can access the cached datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Multi-Node Training**: When using Hugging Face datasets in multi-node setups, you need shared storage accessible by all nodes. Set the `HF_DATASETS_CACHE` environment variable to point to a shared directory (e.g., `HF_DATASETS_CACHE=/shared/hf_cache`) in the yaml file as shown, to ensure all nodes can access the cached datasets.

answer_only_loss_mask=False, # compute loss over full sequence
)

print(remote_ds[0].keys()) # {'context', 'question', 'answer'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent flagged this line: remote_ds is undefined in this local example, so the print would fail.

Suggested change
print(local_ds[0].keys()) # {'question', 'answer'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Databricks deltatable steaming data

5 participants