๐ง Fork Notice: This repository is a fork of EasyLM
Open Cabrita is a comprehensive research archive documenting our systematic investigation into Portuguese language model development. This repository preserves our complete experimental journey, including successful models, failed attempts, and critical insights gained from training across multiple architectures (LLaMA, Gemma) and scales (2B-7B parameters).
๐ Archive Note: This repository serves as a research archive and historical record of our Portuguese LLM experiments conducted between 2023-2024. All methodologies, scripts, and findings are preserved for reproducibility and future research.
๐ Research Paper: Open Cabrita: Challenges and Opportunities for Portuguese Language Models
- Research Summary
- Successful Model
- Experimental Timeline
- Dataset & Preprocessing
- Training Methodology
- Usage Patterns
- Acknowledgments
- License
This archive documents our systematic investigation into Portuguese language model development, spanning 18 months of experimentation (2023-2024). Our research methodology combined theoretical foundations with empirical validation, resulting in both successful models and valuable insights from controlled failures.
Original Research Objectives:
- ๐ฏ Primary Goal: Develop high-quality Portuguese language models through systematic scaling experiments
- ๐ฌ Scientific Approach: Document the complete experimental process, including failures and their causes
- ๏ฟฝ Scaling Investigation: Explore optimal architectures and scales for Portuguese LLM development
- ๐ Community Impact: Provide open-source methodologies to accelerate Portuguese NLP research
Actual Research Outcomes:
- โ Successful Model: Open Cabrita 3B - A high-performing Portuguese language model
- ๏ฟฝ Unexpected Discovery: Data quality emerged through failure analysis as the critical limiting factor
- ๏ฟฝ Architecture Insights: Different model architectures exhibit varying tolerance to data quality issues
- ๐ Methodological Contribution: Complete reproducible pipeline and systematic failure documentation
Open Cabrita 3B represents our most successful experimental outcome - a Portuguese language model that achieved excellent performance metrics and stable training characteristics.
| Specification | Value | Notes |
|---|---|---|
| Architecture | Modified LLaMA-3B | Portuguese-optimized adaptations |
| Parameters | 3.0B | Optimal scale for data quality alignment |
| Context Length | 2048 tokens | Standard context window |
| Training Steps | 400K steps | Converged with stable loss |
| Training Duration | ~2 weeks | Single TPU v3-8 pod |
| Final Status | โ Production Ready | Successfully completed training |
- ๐ Training Stability: Achieved smooth loss convergence without instabilities
- ๐ง๐ท๐ต๐น Language Quality: Generates coherent Portuguese text with proper grammar and syntax
- ๐ Benchmark Performance: Outperforms base 3B model on Portuguese evaluation tasks
Our research followed a systematic experimental methodology, documenting both successful outcomes and controlled failures for scientific completeness.
Objective: Establish baseline capabilities with smaller-scale models
| Model | Architecture | Parameters | Outcome | Key Insights |
|---|---|---|---|---|
| Open Cabrita 3B | LLaMA-3B | 3.0B | โ Success | Established optimal configuration |
Findings: Successfully demonstrated feasibility of Portuguese LLM training when base model pretraining quality aligned with available Portuguese corpus quality (OpenLLaMA-3B's pretraining quality matched our Portuguese data quality).
Objective: Investigate scaling properties and architectural variations
| Model | Architecture | Parameters | Outcome | Primary Challenge |
|---|---|---|---|---|
| LLaMA2 7B Variant | LLaMA2-7B | 7.0B | โ Failed | Convergence instability |
| Gemma 2B | Gemma-2B | 2.0B | โ Failed | Training instabilities |
| Gemma 7B | Gemma-7B | 7.0B | โ Failed | Data quality sensitivity |
Critical Discovery Through Failure Analysis: Larger and more sophisticated models exhibited increased sensitivity to what we hypothesize may be quality mismatches between their original pretraining data and our Portuguese corpus, suggesting that data quality alignmentโrather than preprocessing quality in isolationโcould be a fundamental factor for Portuguese LLM scaling success.
Our systematic scaling experiments unexpectedly revealed data quality as the primary limiting factor, leading us to investigate preprocessing requirements that were not initially the focus of our research.
- Source: MC4 Portuguese corpus (Common Crawl)
- Size: 145GB of Portuguese web text
- Coverage: Comprehensive representation of Portuguese language variants (PT-BR, PT-PT)
- Initial Assessment: Appeared suitable for scaling experiments
- Method: Conventional web text cleaning (standard practice for scaling experiments)
- Expectation: Sufficient for Portuguese LLM scaling
- Reality: Led to training failures in more advanced models (LLaMA2 and Gemma)
- Key Learning: Standard preprocessing inadequate for Portuguese scaling
Our systematic failures across multiple architectures and scales led us to believe that data quality was the potential root cause.
The Data Quality Mismatch Hypothesis: Through systematic failure analysis, our best explanation for the observed patterns is that models originally pretrained on higher-quality datasets (LLaMA2, Gemma) may suffer significant knowledge degradation when continued pretraining uses lower-quality data, even if that data is domain-specific. We hypothesize that the quality gap between original pretraining data and Portuguese continued pretraining data caused these models to lose more general knowledge than they gained Portuguese-specific capabilities. While we didn't investigate this deeply, this hypothesis best explains why more sophisticated models failed while simpler models (OpenLLaMA-3B) succeeded.
Our training approach combined established best practices with Portuguese-specific optimizations, systematically validated across multiple experimental runs.
- Base Model: LLaMA-3B with Portuguese adaptations
- Tokenizer: Custom SentencePiece vocabulary (52K tokens) optimized for Portuguese morphology
- Position Encoding: RoPE (Rotary Position Embedding) for enhanced context modeling
- Normalization: Layer normalization for training stability
- Precision: bf16 mixed precision (memory efficiency without stability loss)
- Sequence Length: 2048 tokens (optimal context window)
- Global Batch Size: 256 (empirically determined optimal for 3B scale)
- Learning Rate Schedule: Cosine decay with warmup (critical for convergence stability)
- Optimizer: AdamW (ฮฒโ=0.9, ฮฒโ=0.95, weight_decay=0.1)
- Hardware: Google Cloud TPU v3-8 pods
- Training Framework: JAX/Flax with pjit for distributed training
- Monitoring: Comprehensive metrics via Weights & Biases
- Checkpointing: Every 5,000 steps with full state recovery capability
- Gradient Checkpointing: Effective memory reduction without performance impact
- Data Pipeline Optimization: Prefetching and parallel loading for TPU efficiency
- Learning Rate Tuning: Fine-grained schedule optimization critical for Portuguese data
- Portuguese-Specific Tokenization: Custom vocabulary significantly improved efficiency
- Original Hypothesis: Modern architectures would handle Portuguese scaling better
- Primary Issue: Training instabilities and poor convergence across models
- Our Best Hypothesis: Their high-quality pretraining data made them particularly sensitive to the quality gap with our Portuguese corpus
- Caveat: This remains our best guess based on systematic comparison of outcomes across different base models
- Environment Setup: Use
scripts/tpu_vm_setup.sh - Data Preparation: Follow
src/download_and_process_dataset.py - Training Execution: Use configurations in
training_scripts/ - Monitoring: Integrate with provided W&B configurations
This research was enabled by the foundational work of the open-source machine learning community. We acknowledge and thank the following projects and contributors:
- EasyLM Training Framework - young-geng - Provided the robust training infrastructure that enabled our systematic experimentation
- JAX LLaMA Implementation - JAX_llama - High-quality LLaMA implementation in JAX/Flax
- Transformers Library - Hugging Face - JAX/Flax model implementations and utilities
- MLX Utilities - mlxu - Essential JAX utilities for distributed training
- JAXSeq Framework - JAXSeq - Methodological inspiration for systematic LLM research
- Google Cloud Platform - TPU v3-8 access through research credits program
- Weights & Biases - Comprehensive experiment tracking and analysis platform
- Common Crawl Foundation - MC4 Portuguese dataset access
This project is released under the same license terms as the original EasyLM framework to maintain compatibility and enable community contributions.