-
Notifications
You must be signed in to change notification settings - Fork 49
feat: databricks deltalake dataset support #920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for the PR! I tested this PR on a Databricks cluster and encountered a DeltaProtocolError. It appears the deltalake reader used here isn't compatible with tables that use Deletion Vectors or Column Mapping (which are now enabled by default on Databricks). Relevant discussions: delta-io/delta-rs#1094
|
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
3df865b to
0936f94
Compare
99fb5e1 to
b990139
Compare
1c75572 to
b8efd47
Compare
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
dd66497 to
caa8a6c
Compare
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
|
/ok to test 937cf25 |
|
/ok to test 7ca3800 |
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
|
/ok to test 2ce5288 |
| global _DELTALAKE_AVAILABLE | ||
| if _DELTALAKE_AVAILABLE is None: | ||
| try: | ||
| import deltalake # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we clean up the noqa stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the noqa is necessary otherwise ruff will complain
| ) | ||
|
|
||
|
|
||
| class HFDeltaLakeDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class doesn't implement the select method. same for DeltaLakeDataset. without this, the user cannot limit the number of data samples that are loaded
| self._shard_info = (num_shards, index) | ||
| return self | ||
|
|
||
| def shuffle(self, buffer_size: int = 1000, seed: Optional[int] = None) -> "HFDeltaLakeDataset": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see shuffle being used in the iterator. how is shuffling happening?
jgerh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed tech pubs review and provided some copyedits and suggestions.
| @@ -1,3 +1,3 @@ | |||
| # Dataset Overview: LLM and VLM Datasets in NeMo Automodel | |||
|
|
|||
| This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This page summarizes the datasets supported in NeMo Automodel for LLMs and VLMs and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism. |
|
|
||
| - **HellaSwag (completion SFT)** | ||
| - Wrapper: `nemo_automodel.components.datasets.llm.hellaswag.HellaSwag` | ||
| - Use case: single-turn completion style SFT where a prompt (ctx) is followed by a gold continuation (ending) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Use case: single-turn, completion-style SFT where a prompt (context) is followed by a gold continuation |
| - Use case: multi-turn conversations and tool calling in OpenAI chat format | ||
| - Sources: local JSON/JSONL or Hugging Face Hub dataset ID | ||
| - Key args: | ||
| - `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `path_or_dataset_id`: path to local file(s) or Hugging Face dataset ID |
| - Sources: local JSON/JSONL or Hugging Face Hub dataset ID | ||
| - Key args: | ||
| - `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID | ||
| - `tokenizer`: tokenizer instance (required. Must have chat template support) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `tokenizer`: tokenizer instance (required; must have chat template support) |
| packed_sequence_size: 8192 # > 0 enables packing | ||
| split_across_pack: false | ||
| ``` | ||
| Use a collater that pads to an FP8-friendly multiple when training with FP8: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Use a collate function that pads to an FP8-friendly multiple when training with FP8: |
| * Quick prototyping across diverse instruction datasets | ||
| * Schema flexibility without needing codebase changes | ||
| * Consistent field names for training loops, regardless of dataset source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Consistent field names for training loops, regardless of dataset source | |
| - Quick prototyping across diverse instruction datasets | |
| - Schema flexibility without requiring code changes | |
| - Consistent field names for training loops, regardless of dataset source |
| break | ||
| ``` | ||
|
|
||
| :::note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :::note | |
| :::{note} |
|
|
||
| ### Multi-Node Slurm Configuration | ||
|
|
||
| :::note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :::{note} |
| ### Multi-Node Slurm Configuration | ||
|
|
||
| :::note | ||
| **Note for Multi-Node Training**: When using Hugging Face datasets in multi-node setups, you need shared storage accessible by all nodes. Set the `HF_DATASETS_CACHE` environment variable to point to a shared directory (e.g., `HF_DATASETS_CACHE=/shared/hf_cache`) in the yaml file as shown, to ensure all nodes can access the cached datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| **Multi-Node Training**: When using Hugging Face datasets in multi-node setups, you need shared storage accessible by all nodes. Set the `HF_DATASETS_CACHE` environment variable to point to a shared directory (e.g., `HF_DATASETS_CACHE=/shared/hf_cache`) in the yaml file as shown, to ensure all nodes can access the cached datasets. |
| answer_only_loss_mask=False, # compute loss over full sequence | ||
| ) | ||
|
|
||
| print(remote_ds[0].keys()) # {'context', 'question', 'answer'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agent flagged this line: remote_ds is undefined in this local example, so the print would fail.
| print(local_ds[0].keys()) # {'question', 'answer'} |
No description provided.