diff --git a/agents/azure-infrastructure-architect.agent.md b/agents/azure-infrastructure-architect.agent.md new file mode 100644 index 00000000..ef062cdb --- /dev/null +++ b/agents/azure-infrastructure-architect.agent.md @@ -0,0 +1,78 @@ +--- +name: Azure Infrastructure Architect +description: "Expert Azure infrastructure architect that designs, reviews, and implements Azure solutions following Well-Architected Framework and Cloud Adoption Framework best practices." +tools: + - codebase + - runCommand + - editFiles +--- + +You are an expert Azure Infrastructure Architect with deep knowledge of: +- Azure Landing Zones and Cloud Adoption Framework (CAF) +- Well-Architected Framework (WAF) - all 5 pillars +- Infrastructure as Code (Bicep, Terraform, Azure Verified Modules) +- CI/CD pipelines for Azure deployments + +## Your Capabilities + +### 1. Design Azure Landing Zones +When asked to design infrastructure: +- Propose management group hierarchy +- Design subscription organization (platform vs. landing zones) +- Recommend networking topology (hub-spoke or Virtual WAN) +- Define identity and governance strategy + +### 2. Review Architectures +When asked to review: +- Assess against all 5 WAF pillars (Reliability, Security, Cost, Operations, Performance) +- Classify findings by severity (Critical, High, Medium, Low) +- Provide specific remediation recommendations +- Prioritize improvements + +### 3. Implement Infrastructure +When asked to implement or deploy: +- Generate Bicep code (preferred) or Terraform +- Use Azure Verified Modules (AVM) where available +- Follow naming conventions and security best practices +- Create deployment pipelines (GitHub Actions or Azure DevOps) + +### 4. Deploy Applications +When asked to deploy an application: +1. **Analyze** the application (language, framework, dependencies) +2. **Design** the target architecture +3. **Generate** IaC for required resources +4. **Create** CI/CD pipeline +5. **Provide** deployment commands + +## Security Requirements (Non-Negotiable) +- Never hardcode credentials or secrets +- Always use managed identities for service-to-service auth +- Enable Key Vault with purge protection +- Use private endpoints for PaaS services +- Enforce TLS 1.2 minimum +- Disable storage/Cosmos key access (use RBAC) + +## Deployment Workflow +Always follow this pattern: +1. Preview changes with `az deployment group what-if` +2. Validate before deploying +3. Deploy with explicit resource group targeting +4. Provide Azure Portal links after deployment + +## Response Style +- Be concise and actionable +- Provide code, not just descriptions +- Use tables for comparisons +- Include commands that can be run directly +- Warn about security risks prominently + +## Example Interaction + +**User**: "Deploy a Python Flask API with PostgreSQL to Azure" + +**Your approach**: +1. Confirm requirements (region, environment, scale needs) +2. Generate architecture: App Service + Azure Database for PostgreSQL + Key Vault +3. Create Bicep files with proper security (managed identity, private endpoints) +4. Create GitHub Actions workflow +5. Provide step-by-step deployment commands diff --git a/skills/azure-infra-patterns/SKILL.md b/skills/azure-infra-patterns/SKILL.md new file mode 100644 index 00000000..42af7cc9 --- /dev/null +++ b/skills/azure-infra-patterns/SKILL.md @@ -0,0 +1,202 @@ +--- +name: azure-infra-patterns +description: | + Implementation patterns for Azure infrastructure using Bicep, Terraform, and Azure Verified Modules. + Use when: + (1) Implementing infrastructure-as-code for Azure resources + (2) Choosing between Bicep and Terraform for a project + (3) Using Azure Verified Modules (AVM) or Azure Landing Zone (ALZ) modules + (4) Setting up CI/CD pipelines for infrastructure deployment + (5) Converting architecture designs to deployable code + (6) Implementing security-hardened resource configurations + Triggers: Bicep, Terraform, IaC, infrastructure code, AVM, Azure Verified Modules, + ALZ, Azure Landing Zones, ARM template, HCL, deployment +--- + +# Azure Infrastructure Implementation Patterns + +Transform architecture designs into secure, repeatable infrastructure code. + +## Tool Selection + +| Factor | Bicep | Terraform | +|--------|-------|-----------| +| Azure-native | ✅ First-class | Good (AzureRM/AzAPI) | +| Multi-cloud | ❌ | ✅ | +| State management | Azure handles | Backend required | +| Module ecosystem | AVM | AVM + Registry | +| Learning curve | Lower | Medium | +| Team skills | Azure-focused | Platform engineers | + +**Default choice**: Bicep for Azure-only projects, Terraform for multi-cloud or existing Terraform expertise. + +## Project Structure + +### Bicep Projects +``` +project/ +├── infra/ +│ ├── main.bicep # Entry point +│ ├── main.bicepparam # Parameters +│ ├── modules/ # Custom modules +│ │ ├── networking/ +│ │ ├── compute/ +│ │ └── data/ +│ └── environments/ +│ ├── dev.bicepparam +│ └── prod.bicepparam +``` + +### Terraform Projects +``` +project/ +├── terraform/ +│ ├── main.tf # Root module +│ ├── variables.tf # Input variables +│ ├── outputs.tf # Outputs +│ ├── versions.tf # Provider constraints +│ ├── backend.tf # State backend +│ └── modules/ +│ ├── networking/ +│ └── compute/ +``` + +## Azure Verified Modules (AVM) + +Prefer AVM over custom implementations for production workloads. + +### Bicep AVM Usage +```bicep +module storageAccount 'br/public:avm/res/storage/storage-account:0.9.0' = { + name: 'storage-deployment' + params: { + name: storageAccountName + location: location + skuName: 'Standard_LRS' + kind: 'StorageV2' + publicNetworkAccess: 'Disabled' + networkAcls: { + defaultAction: 'Deny' + } + } +} + +module keyVault 'br/public:avm/res/key-vault/vault:0.6.0' = { + name: 'keyvault-deployment' + params: { + name: keyVaultName + location: location + enableRbacAuthorization: true + enablePurgeProtection: true + } +} +``` + +### Terraform AVM Usage +```hcl +module "storage_account" { + source = "Azure/avm-res-storage-storageaccount/azurerm" + version = "0.1.0" + + name = var.storage_account_name + resource_group_name = azurerm_resource_group.main.name + location = var.location + + public_network_access_enabled = false + network_rules = { + default_action = "Deny" + } +} +``` + +## Security Requirements (Non-Negotiable) + +Every resource must implement: + +| Requirement | Implementation | +|-------------|----------------| +| No hardcoded credentials | Key Vault references | +| Managed identities | System or user-assigned | +| Encryption at rest | Platform or CMK | +| TLS 1.2 minimum | `minTlsVersion: 'TLS1_2'` | +| Private networking | Private endpoints | +| RBAC authorization | `enableRbacAuthorization: true` | + +### Critical Security Settings +```bicep +// Storage - NEVER allow public access +resource storage 'Microsoft.Storage/storageAccounts@2023-01-01' = { + properties: { + allowBlobPublicAccess: false + allowSharedKeyAccess: false // Use RBAC + minimumTlsVersion: 'TLS1_2' + supportsHttpsTrafficOnly: true + } +} + +// Key Vault - NEVER disable purge protection +resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = { + properties: { + enablePurgeProtection: true // NEVER false + enableRbacAuthorization: true + publicNetworkAccess: 'Disabled' + } +} + +// Container Registry - NEVER enable anonymous pull +resource acr 'Microsoft.ContainerRegistry/registries@2023-07-01' = { + properties: { + anonymousPullEnabled: false + adminUserEnabled: false + } +} +``` + +## Deployment Workflow + +### Bicep Deployment +```bash +# 1. Validate +az bicep build --file infra/main.bicep + +# 2. Preview (ALWAYS before deploy) +az deployment group what-if \ + --resource-group rg-prod \ + --template-file infra/main.bicep \ + --parameters @infra/environments/prod.bicepparam + +# 3. Deploy +az deployment group create \ + --resource-group rg-prod \ + --template-file infra/main.bicep \ + --parameters @infra/environments/prod.bicepparam +``` + +### Terraform Deployment +```bash +# 1. Format and validate +terraform fmt -recursive +terraform validate + +# 2. Plan (ALWAYS before apply) +terraform plan -out=tfplan + +# 3. Apply +terraform apply tfplan +``` + +### Azure Developer CLI (azd) +```bash +# Preview +azd provision --preview + +# Deploy infrastructure and application +azd up +``` + +## References + +- **Bicep patterns**: See [references/bicep.md](references/bicep.md) +- **Terraform patterns**: See [references/terraform.md](references/terraform.md) +- **CI/CD pipelines**: See [references/cicd.md](references/cicd.md) +- **Naming conventions**: See [references/naming.md](references/naming.md) diff --git a/skills/azure-infra-patterns/references/bicep.md b/skills/azure-infra-patterns/references/bicep.md new file mode 100644 index 00000000..00ee1d4a --- /dev/null +++ b/skills/azure-infra-patterns/references/bicep.md @@ -0,0 +1,306 @@ +# Bicep Implementation Patterns + +## Table of Contents +1. [Module Patterns](#module-patterns) +2. [Parameter Patterns](#parameter-patterns) +3. [Conditional Deployment](#conditional-deployment) +4. [Loops and Collections](#loops-and-collections) +5. [Cross-Resource References](#cross-resource-references) +6. [Common Resource Patterns](#common-resource-patterns) + +## Module Patterns + +### Module Definition +```bicep +// modules/storage/storageAccount.bicep +@description('Storage account name') +param storageAccountName string + +@description('Location for resources') +param location string = resourceGroup().location + +@allowed(['Standard_LRS', 'Standard_GRS', 'Standard_ZRS']) +param sku string = 'Standard_LRS' + +@description('Tags to apply') +param tags object = {} + +resource storageAccount 'Microsoft.Storage/storageAccounts@2023-01-01' = { + name: storageAccountName + location: location + tags: tags + sku: { name: sku } + kind: 'StorageV2' + properties: { + minimumTlsVersion: 'TLS1_2' + supportsHttpsTrafficOnly: true + allowBlobPublicAccess: false + allowSharedKeyAccess: false + networkAcls: { + defaultAction: 'Deny' + bypass: 'AzureServices' + } + } +} + +output storageAccountId string = storageAccount.id +output primaryEndpoints object = storageAccount.properties.primaryEndpoints +``` + +### Module Consumption +```bicep +// main.bicep +module storage 'modules/storage/storageAccount.bicep' = { + name: 'storage-deployment' + params: { + storageAccountName: 'st${uniqueString(resourceGroup().id)}' + location: location + sku: 'Standard_ZRS' + tags: tags + } +} + +// Use output +output blobEndpoint string = storage.outputs.primaryEndpoints.blob +``` + +### Azure Verified Modules +```bicep +// Prefer AVM over custom modules +module storageAccount 'br/public:avm/res/storage/storage-account:0.9.0' = { + name: 'storage-deployment' + params: { + name: storageAccountName + location: location + } +} +``` + +## Parameter Patterns + +### Parameter File (.bicepparam) +```bicep +// environments/prod.bicepparam +using '../main.bicep' + +param environment = 'prod' +param location = 'eastus' +param tags = { + Environment: 'Production' + CostCenter: 'CC-12345' + Owner: 'platform@company.com' +} +``` + +### Secure Parameters +```bicep +@secure() +param adminPassword string + +// Reference from Key Vault in parameter file +param adminPassword = az.getSecret('', '', '', '') +``` + +### Parameter Validation +```bicep +@minLength(3) +@maxLength(24) +param storageAccountName string + +@allowed(['dev', 'staging', 'prod']) +param environment string + +@minValue(1) +@maxValue(10) +param instanceCount int = 2 +``` + +## Conditional Deployment + +### Resource Conditions +```bicep +param deployAppInsights bool = true + +resource appInsights 'Microsoft.Insights/components@2020-02-02' = if (deployAppInsights) { + name: 'appi-${workload}' + location: location + kind: 'web' + properties: { + Application_Type: 'web' + } +} + +// Conditional output +output appInsightsKey string = deployAppInsights ? appInsights.properties.InstrumentationKey : '' +``` + +### Environment-Based Configuration +```bicep +var skuConfig = { + dev: { name: 'B1', capacity: 1 } + staging: { name: 'S1', capacity: 2 } + prod: { name: 'P1v3', capacity: 3 } +} + +resource appServicePlan 'Microsoft.Web/serverfarms@2023-01-01' = { + name: 'asp-${workload}-${environment}' + location: location + sku: skuConfig[environment] +} +``` + +## Loops and Collections + +### Array Loop +```bicep +param storageAccounts array = [ + { name: 'stdata', sku: 'Standard_LRS' } + { name: 'stlogs', sku: 'Standard_GRS' } +] + +resource storage 'Microsoft.Storage/storageAccounts@2023-01-01' = [for account in storageAccounts: { + name: '${account.name}${uniqueString(resourceGroup().id)}' + location: location + sku: { name: account.sku } + kind: 'StorageV2' +}] +``` + +### Index Loop +```bicep +param subnetCount int = 3 + +resource subnets 'Microsoft.Network/virtualNetworks/subnets@2023-05-01' = [for i in range(0, subnetCount): { + name: 'snet-${i}' + properties: { + addressPrefix: '10.0.${i}.0/24' + } +}] +``` + +### Module Loop +```bicep +param webApps array = ['api', 'web', 'admin'] + +module apps 'modules/webapp.bicep' = [for app in webApps: { + name: 'deploy-${app}' + params: { + appName: 'app-${workload}-${app}-${environment}' + location: location + } +}] +``` + +## Cross-Resource References + +### Same Template +```bicep +resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = { + name: keyVaultName + // ... +} + +resource secret 'Microsoft.KeyVault/vaults/secrets@2023-07-01' = { + parent: keyVault + name: 'mySecret' + properties: { + value: secretValue + } +} +``` + +### Existing Resources +```bicep +// Same resource group +resource existingStorage 'Microsoft.Storage/storageAccounts@2023-01-01' existing = { + name: 'stexisting' +} + +// Different resource group +resource existingVnet 'Microsoft.Network/virtualNetworks@2023-05-01' existing = { + name: 'vnet-shared' + scope: resourceGroup('shared-networking-rg') +} + +// Different subscription +resource existingKeyVault 'Microsoft.KeyVault/vaults@2023-07-01' existing = { + name: 'kv-central' + scope: resourceGroup('sub-id', 'central-rg') +} +``` + +## Common Resource Patterns + +### Web App with Managed Identity +```bicep +resource webApp 'Microsoft.Web/sites@2023-01-01' = { + name: webAppName + location: location + identity: { type: 'SystemAssigned' } + properties: { + serverFarmId: appServicePlan.id + httpsOnly: true + siteConfig: { + minTlsVersion: '1.2' + ftpsState: 'Disabled' + alwaysOn: true + } + } +} +``` + +### Private Endpoint +```bicep +resource privateEndpoint 'Microsoft.Network/privateEndpoints@2023-05-01' = { + name: 'pep-${resourceName}' + location: location + properties: { + subnet: { id: subnetId } + privateLinkServiceConnections: [ + { + name: 'plsc-${resourceName}' + properties: { + privateLinkServiceId: targetResourceId + groupIds: ['blob'] + } + } + ] + } +} +``` + +### RBAC Role Assignment +```bicep +resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = { + name: guid(resourceGroup().id, principalId, roleDefinitionId) + scope: targetResource + properties: { + roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', roleDefinitionId) + principalId: principalId + principalType: 'ServicePrincipal' + } +} +``` + +### Diagnostic Settings +```bicep +resource diagnosticSettings 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = { + name: 'diag-${resourceName}' + scope: targetResource + properties: { + workspaceId: logAnalyticsWorkspaceId + logs: [ + { + categoryGroup: 'allLogs' + enabled: true + } + ] + metrics: [ + { + category: 'AllMetrics' + enabled: true + } + ] + } +} +``` diff --git a/skills/azure-infra-patterns/references/cicd.md b/skills/azure-infra-patterns/references/cicd.md new file mode 100644 index 00000000..4606fdae --- /dev/null +++ b/skills/azure-infra-patterns/references/cicd.md @@ -0,0 +1,408 @@ +# CI/CD Pipeline Patterns + +## Table of Contents +1. [GitHub Actions](#github-actions) +2. [Azure DevOps Pipelines](#azure-devops-pipelines) +3. [Authentication Patterns](#authentication-patterns) +4. [Deployment Strategies](#deployment-strategies) + +## GitHub Actions + +### Bicep Deployment Workflow +```yaml +name: Deploy Infrastructure + +on: + push: + branches: [main] + paths: ['infra/**'] + pull_request: + branches: [main] + paths: ['infra/**'] + +permissions: + id-token: write + contents: read + pull-requests: write + +env: + RESOURCE_GROUP: rg-myapp-prod + +jobs: + validate: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Azure Login + uses: azure/login@v2 + with: + client-id: ${{ vars.AZURE_CLIENT_ID }} + tenant-id: ${{ vars.AZURE_TENANT_ID }} + subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }} + + - name: Validate Bicep + run: az bicep build --file infra/main.bicep + + - name: What-If (PR only) + if: github.event_name == 'pull_request' + run: | + az deployment group what-if \ + --resource-group ${{ env.RESOURCE_GROUP }} \ + --template-file infra/main.bicep \ + --parameters @infra/main.bicepparam + + deploy: + needs: validate + if: github.ref == 'refs/heads/main' + runs-on: ubuntu-latest + environment: production + + steps: + - uses: actions/checkout@v4 + + - name: Azure Login + uses: azure/login@v2 + with: + client-id: ${{ vars.AZURE_CLIENT_ID }} + tenant-id: ${{ vars.AZURE_TENANT_ID }} + subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }} + + - name: Deploy + run: | + az deployment group create \ + --resource-group ${{ env.RESOURCE_GROUP }} \ + --template-file infra/main.bicep \ + --parameters @infra/main.bicepparam +``` + +### Terraform Deployment Workflow +```yaml +name: Terraform Deploy + +on: + push: + branches: [main] + paths: ['terraform/**'] + pull_request: + branches: [main] + paths: ['terraform/**'] + +permissions: + id-token: write + contents: read + pull-requests: write + +env: + TF_VERSION: '1.6.0' + WORKING_DIR: './terraform' + +jobs: + plan: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Setup Terraform + uses: hashicorp/setup-terraform@v3 + with: + terraform_version: ${{ env.TF_VERSION }} + + - name: Azure Login + uses: azure/login@v2 + with: + client-id: ${{ vars.AZURE_CLIENT_ID }} + tenant-id: ${{ vars.AZURE_TENANT_ID }} + subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }} + + - name: Terraform Init + working-directory: ${{ env.WORKING_DIR }} + run: terraform init + + - name: Terraform Validate + working-directory: ${{ env.WORKING_DIR }} + run: terraform validate + + - name: Terraform Plan + working-directory: ${{ env.WORKING_DIR }} + run: terraform plan -out=tfplan + + - name: Upload Plan + uses: actions/upload-artifact@v4 + with: + name: tfplan + path: ${{ env.WORKING_DIR }}/tfplan + + apply: + needs: plan + if: github.ref == 'refs/heads/main' + runs-on: ubuntu-latest + environment: production + + steps: + - uses: actions/checkout@v4 + + - name: Setup Terraform + uses: hashicorp/setup-terraform@v3 + with: + terraform_version: ${{ env.TF_VERSION }} + + - name: Azure Login + uses: azure/login@v2 + with: + client-id: ${{ vars.AZURE_CLIENT_ID }} + tenant-id: ${{ vars.AZURE_TENANT_ID }} + subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }} + + - name: Download Plan + uses: actions/download-artifact@v4 + with: + name: tfplan + path: ${{ env.WORKING_DIR }} + + - name: Terraform Init + working-directory: ${{ env.WORKING_DIR }} + run: terraform init + + - name: Terraform Apply + working-directory: ${{ env.WORKING_DIR }} + run: terraform apply -auto-approve tfplan +``` + +## Azure DevOps Pipelines + +### Bicep Pipeline +```yaml +trigger: + branches: + include: + - main + paths: + include: + - infra/** + +pool: + vmImage: 'ubuntu-latest' + +variables: + azureServiceConnection: 'azure-prod' + resourceGroup: 'rg-myapp-prod' + +stages: + - stage: Validate + jobs: + - job: ValidateBicep + steps: + - task: AzureCLI@2 + displayName: 'Validate Bicep' + inputs: + azureSubscription: $(azureServiceConnection) + scriptType: 'bash' + scriptLocation: 'inlineScript' + inlineScript: | + az bicep build --file infra/main.bicep + + - task: AzureCLI@2 + displayName: 'What-If' + inputs: + azureSubscription: $(azureServiceConnection) + scriptType: 'bash' + scriptLocation: 'inlineScript' + inlineScript: | + az deployment group what-if \ + --resource-group $(resourceGroup) \ + --template-file infra/main.bicep + + - stage: Deploy + dependsOn: Validate + condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main')) + jobs: + - deployment: DeployInfra + environment: 'production' + strategy: + runOnce: + deploy: + steps: + - checkout: self + + - task: AzureCLI@2 + displayName: 'Deploy Bicep' + inputs: + azureSubscription: $(azureServiceConnection) + scriptType: 'bash' + scriptLocation: 'inlineScript' + inlineScript: | + az deployment group create \ + --resource-group $(resourceGroup) \ + --template-file infra/main.bicep +``` + +### Terraform Pipeline +```yaml +trigger: + branches: + include: + - main + paths: + include: + - terraform/** + +pool: + vmImage: 'ubuntu-latest' + +variables: + terraformVersion: '1.6.0' + azureServiceConnection: 'azure-prod' + workingDirectory: '$(System.DefaultWorkingDirectory)/terraform' + +stages: + - stage: Plan + jobs: + - job: TerraformPlan + steps: + - task: TerraformInstaller@1 + inputs: + terraformVersion: $(terraformVersion) + + - task: TerraformTaskV4@4 + displayName: 'Terraform Init' + inputs: + provider: 'azurerm' + command: 'init' + workingDirectory: $(workingDirectory) + backendServiceArm: $(azureServiceConnection) + + - task: TerraformTaskV4@4 + displayName: 'Terraform Plan' + inputs: + provider: 'azurerm' + command: 'plan' + workingDirectory: $(workingDirectory) + environmentServiceNameAzureRM: $(azureServiceConnection) + commandOptions: '-out=tfplan' + + - publish: $(workingDirectory)/tfplan + artifact: tfplan + + - stage: Apply + dependsOn: Plan + condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main')) + jobs: + - deployment: TerraformApply + environment: 'production' + strategy: + runOnce: + deploy: + steps: + - checkout: self + + - download: current + artifact: tfplan + + - task: TerraformInstaller@1 + inputs: + terraformVersion: $(terraformVersion) + + - task: TerraformTaskV4@4 + displayName: 'Terraform Init' + inputs: + provider: 'azurerm' + command: 'init' + workingDirectory: $(workingDirectory) + backendServiceArm: $(azureServiceConnection) + + - task: TerraformTaskV4@4 + displayName: 'Terraform Apply' + inputs: + provider: 'azurerm' + command: 'apply' + workingDirectory: $(workingDirectory) + environmentServiceNameAzureRM: $(azureServiceConnection) + commandOptions: '$(Pipeline.Workspace)/tfplan/tfplan' +``` + +## Authentication Patterns + +### GitHub Actions - Federated Credentials (OIDC) +```yaml +# Recommended - no secrets to manage +permissions: + id-token: write + contents: read + +- name: Azure Login + uses: azure/login@v2 + with: + client-id: ${{ vars.AZURE_CLIENT_ID }} + tenant-id: ${{ vars.AZURE_TENANT_ID }} + subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }} +``` + +### Service Principal Setup for OIDC +```bash +# Create service principal +az ad sp create-for-rbac --name "github-actions-sp" --role contributor \ + --scopes /subscriptions/$SUBSCRIPTION_ID + +# Add federated credential +az ad app federated-credential create \ + --id $APP_ID \ + --parameters '{ + "name": "github-main", + "issuer": "https://token.actions.githubusercontent.com", + "subject": "repo:org/repo:ref:refs/heads/main", + "audiences": ["api://AzureADTokenExchange"] + }' +``` + +## Deployment Strategies + +### Multi-Environment Pipeline +```yaml +jobs: + deploy-dev: + uses: ./.github/workflows/deploy.yml + with: + environment: dev + resource-group: rg-myapp-dev + secrets: inherit + + deploy-staging: + needs: deploy-dev + uses: ./.github/workflows/deploy.yml + with: + environment: staging + resource-group: rg-myapp-staging + secrets: inherit + + deploy-prod: + needs: deploy-staging + uses: ./.github/workflows/deploy.yml + with: + environment: prod + resource-group: rg-myapp-prod + secrets: inherit +``` + +### Blue-Green with Slots +```yaml +- name: Deploy to Staging Slot + run: | + az webapp deployment source config-zip \ + --name app-myapp \ + --resource-group rg-myapp \ + --slot staging \ + --src app.zip + +- name: Verify Staging + run: | + curl -f https://app-myapp-staging.azurewebsites.net/health + +- name: Swap Slots + run: | + az webapp deployment slot swap \ + --name app-myapp \ + --resource-group rg-myapp \ + --slot staging \ + --target-slot production +``` diff --git a/skills/azure-infra-patterns/references/naming.md b/skills/azure-infra-patterns/references/naming.md new file mode 100644 index 00000000..d4c8803d --- /dev/null +++ b/skills/azure-infra-patterns/references/naming.md @@ -0,0 +1,131 @@ +# Azure Resource Naming Conventions + +## Standard Format +``` +{resource-type}-{workload}-{environment}-{region}-{instance} +``` + +## Resource-Specific Prefixes + +| Resource Type | Prefix | Example | +|--------------|--------|---------| +| Resource Group | rg | rg-webapp-prod-eastus | +| Storage Account | st | stwebappprodeus001 | +| Key Vault | kv | kv-webapp-prod-eus | +| App Service | app | app-webapp-prod-eastus | +| App Service Plan | asp | asp-webapp-prod-eastus | +| Function App | func | func-processor-prod-eastus | +| Container Registry | cr | crwebappprod | +| Container App | ca | ca-api-prod-eastus | +| Container App Environment | cae | cae-platform-prod-eastus | +| Virtual Network | vnet | vnet-hub-prod-eastus | +| Subnet | snet | snet-web-prod-eastus | +| Network Security Group | nsg | nsg-web-prod-eastus | +| Application Gateway | agw | agw-webapp-prod-eastus | +| Load Balancer | lb | lb-webapp-prod-eastus | +| Public IP | pip | pip-agw-prod-eastus | +| Private Endpoint | pep | pep-storage-prod-eastus | +| Virtual Machine | vm | vm-jumpbox-prod-eastus | +| Virtual Machine Scale Set | vmss | vmss-web-prod-eastus | +| Cosmos DB | cosmos | cosmos-webapp-prod-eastus | +| SQL Server | sql | sql-webapp-prod-eastus | +| SQL Database | sqldb | sqldb-users-prod | +| Log Analytics Workspace | log | log-platform-prod-eastus | +| Application Insights | appi | appi-webapp-prod-eastus | +| Managed Identity | id | id-webapp-prod-eastus | +| Azure Kubernetes Service | aks | aks-platform-prod-eastus | + +## Naming Constraints + +### Storage Account +- 3-24 characters +- Lowercase letters and numbers only +- Globally unique +- Pattern: `st{workload}{env}{region}{instance}` + +### Key Vault +- 3-24 characters +- Alphanumeric and hyphens +- Start with letter +- Globally unique + +### Container Registry +- 5-50 characters +- Alphanumeric only +- Globally unique + +### Resource Group +- 1-90 characters +- Alphanumeric, underscores, hyphens, periods, parentheses +- Cannot end with period + +## Bicep Naming Variables +```bicep +var nameSuffix = '${workload}-${environment}-${location}' + +var resourceNames = { + resourceGroup: 'rg-${nameSuffix}' + storageAccount: 'st${replace(nameSuffix, '-', '')}${uniqueString(resourceGroup().id)}' + keyVault: 'kv-${take(nameSuffix, 17)}-${uniqueString(resourceGroup().id)}' + appService: 'app-${nameSuffix}' + appServicePlan: 'asp-${nameSuffix}' +} +``` + +## Terraform Naming Variables +```hcl +locals { + name_suffix = "${var.workload}-${var.environment}-${var.location}" + + resource_names = { + resource_group = "rg-${local.name_suffix}" + storage_account = "st${replace(local.name_suffix, "-", "")}${random_string.suffix.result}" + key_vault = "kv-${substr(local.name_suffix, 0, 17)}-${random_string.suffix.result}" + app_service = "app-${local.name_suffix}" + app_service_plan = "asp-${local.name_suffix}" + } +} + +resource "random_string" "suffix" { + length = 4 + special = false + upper = false +} +``` + +## Region Abbreviations + +| Region | Abbreviation | +|--------|-------------| +| eastus | eus | +| eastus2 | eus2 | +| westus | wus | +| westus2 | wus2 | +| westus3 | wus3 | +| centralus | cus | +| northcentralus | ncus | +| southcentralus | scus | +| westeurope | weu | +| northeurope | neu | +| uksouth | uks | +| ukwest | ukw | +| southeastasia | sea | +| eastasia | ea | +| australiaeast | aue | +| australiasoutheast | ause | +| japaneast | jpe | +| japanwest | jpw | +| canadacentral | cac | +| canadaeast | cae | +| brazilsouth | brs | + +## Environment Abbreviations + +| Environment | Abbreviation | +|-------------|-------------| +| Development | dev | +| Testing | test | +| Staging | staging | +| Production | prod | +| Sandbox | sbx | +| Disaster Recovery | dr | diff --git a/skills/azure-infra-patterns/references/terraform.md b/skills/azure-infra-patterns/references/terraform.md new file mode 100644 index 00000000..70cbc64d --- /dev/null +++ b/skills/azure-infra-patterns/references/terraform.md @@ -0,0 +1,358 @@ +# Terraform Implementation Patterns + +## Table of Contents +1. [Provider Configuration](#provider-configuration) +2. [State Management](#state-management) +3. [Module Patterns](#module-patterns) +4. [Variable Patterns](#variable-patterns) +5. [Common Resource Patterns](#common-resource-patterns) +6. [Data Sources](#data-sources) + +## Provider Configuration + +### AzureRM Provider +```hcl +# versions.tf +terraform { + required_version = ">= 1.5.0" + + required_providers { + azurerm = { + source = "hashicorp/azurerm" + version = "~> 3.85" + } + azapi = { + source = "azure/azapi" + version = "~> 1.10" + } + } +} + +# main.tf +provider "azurerm" { + features { + key_vault { + purge_soft_delete_on_destroy = false + recover_soft_deleted_key_vaults = true + } + resource_group { + prevent_deletion_if_contains_resources = true + } + } +} +``` + +### Multiple Subscriptions +```hcl +provider "azurerm" { + alias = "connectivity" + subscription_id = var.connectivity_subscription_id + features {} +} + +provider "azurerm" { + alias = "identity" + subscription_id = var.identity_subscription_id + features {} +} + +resource "azurerm_virtual_network" "hub" { + provider = azurerm.connectivity + # ... +} +``` + +## State Management + +### Azure Storage Backend +```hcl +# backend.tf +terraform { + backend "azurerm" { + resource_group_name = "rg-terraform-state" + storage_account_name = "stterraformstate" + container_name = "tfstate" + key = "prod.terraform.tfstate" + use_azuread_auth = true + } +} +``` + +### State Initialization +```bash +# Initialize with backend config +terraform init \ + -backend-config="storage_account_name=stterraformstate" \ + -backend-config="container_name=tfstate" \ + -backend-config="key=prod.tfstate" +``` + +## Module Patterns + +### Module Structure +``` +modules/ +└── storage/ + ├── main.tf + ├── variables.tf + ├── outputs.tf + └── versions.tf +``` + +### Module Definition +```hcl +# modules/storage/main.tf +resource "azurerm_storage_account" "this" { + name = var.name + resource_group_name = var.resource_group_name + location = var.location + account_tier = var.account_tier + account_replication_type = var.replication_type + + min_tls_version = "TLS1_2" + https_traffic_only_enabled = true + allow_nested_items_to_be_public = false + shared_access_key_enabled = false + + network_rules { + default_action = "Deny" + bypass = ["AzureServices"] + } + + tags = var.tags +} + +# modules/storage/variables.tf +variable "name" { + type = string + description = "Storage account name" +} + +variable "resource_group_name" { + type = string + description = "Resource group name" +} + +variable "location" { + type = string + description = "Azure region" +} + +variable "account_tier" { + type = string + default = "Standard" +} + +variable "replication_type" { + type = string + default = "LRS" +} + +variable "tags" { + type = map(string) + default = {} +} + +# modules/storage/outputs.tf +output "id" { + value = azurerm_storage_account.this.id +} + +output "primary_blob_endpoint" { + value = azurerm_storage_account.this.primary_blob_endpoint +} +``` + +### Module Usage +```hcl +module "storage" { + source = "./modules/storage" + + name = "st${var.workload}${var.environment}" + resource_group_name = azurerm_resource_group.main.name + location = var.location + replication_type = "ZRS" + tags = local.tags +} +``` + +### Azure Verified Modules +```hcl +module "storage_account" { + source = "Azure/avm-res-storage-storageaccount/azurerm" + version = "0.1.0" + + name = var.storage_account_name + resource_group_name = azurerm_resource_group.main.name + location = var.location +} +``` + +## Variable Patterns + +### Variable Definitions +```hcl +# variables.tf +variable "environment" { + type = string + description = "Environment name" + + validation { + condition = contains(["dev", "staging", "prod"], var.environment) + error_message = "Environment must be dev, staging, or prod." + } +} + +variable "location" { + type = string + description = "Azure region" + default = "eastus" +} + +variable "tags" { + type = map(string) + description = "Tags to apply to resources" + default = {} +} + +variable "db_password" { + type = string + description = "Database password" + sensitive = true +} +``` + +### Local Values +```hcl +locals { + name_prefix = "${var.workload}-${var.environment}-${var.location}" + + resource_names = { + resource_group = "rg-${local.name_prefix}" + storage_account = "st${replace(local.name_prefix, "-", "")}" + key_vault = "kv-${substr(local.name_prefix, 0, 17)}" + } + + tags = merge(var.tags, { + Environment = var.environment + ManagedBy = "Terraform" + }) +} +``` + +### Variable Files +```hcl +# environments/prod.tfvars +environment = "prod" +location = "eastus" + +tags = { + CostCenter = "CC-12345" + Owner = "platform@company.com" +} +``` + +## Common Resource Patterns + +### Resource Group +```hcl +resource "azurerm_resource_group" "main" { + name = "rg-${var.workload}-${var.environment}" + location = var.location + tags = local.tags +} +``` + +### Key Vault +```hcl +resource "azurerm_key_vault" "main" { + name = "kv-${var.workload}-${var.environment}" + location = azurerm_resource_group.main.location + resource_group_name = azurerm_resource_group.main.name + tenant_id = data.azurerm_client_config.current.tenant_id + sku_name = "standard" + + enable_rbac_authorization = true + purge_protection_enabled = true # NEVER set to false + soft_delete_retention_days = 90 + public_network_access_enabled = false + + network_acls { + default_action = "Deny" + bypass = "AzureServices" + } + + tags = local.tags +} +``` + +### Role Assignment +```hcl +resource "azurerm_role_assignment" "kv_secrets_user" { + scope = azurerm_key_vault.main.id + role_definition_name = "Key Vault Secrets User" + principal_id = azurerm_linux_web_app.main.identity[0].principal_id +} +``` + +### Private Endpoint +```hcl +resource "azurerm_private_endpoint" "storage" { + name = "pep-${azurerm_storage_account.main.name}" + location = azurerm_resource_group.main.location + resource_group_name = azurerm_resource_group.main.name + subnet_id = azurerm_subnet.private_endpoints.id + + private_service_connection { + name = "psc-storage" + private_connection_resource_id = azurerm_storage_account.main.id + subresource_names = ["blob"] + is_manual_connection = false + } + + private_dns_zone_group { + name = "dns-zone-group" + private_dns_zone_ids = [azurerm_private_dns_zone.blob.id] + } + + tags = local.tags +} +``` + +## Data Sources + +### Current Context +```hcl +data "azurerm_client_config" "current" {} + +data "azurerm_subscription" "current" {} +``` + +### Existing Resources +```hcl +data "azurerm_resource_group" "existing" { + name = "rg-shared-services" +} + +data "azurerm_key_vault" "shared" { + name = "kv-shared-secrets" + resource_group_name = data.azurerm_resource_group.existing.name +} + +data "azurerm_key_vault_secret" "db_password" { + name = "db-password" + key_vault_id = data.azurerm_key_vault.shared.id +} +``` + +### Built-in Role IDs +```hcl +locals { + role_ids = { + contributor = "b24988ac-6180-42a0-ab88-20f7382dd24c" + reader = "acdd72a7-3385-48ef-bd42-f606fba81ae7" + storage_blob_data_contrib = "ba92f5b4-2d11-453d-a403-e96b0029c9fe" + key_vault_secrets_user = "4633458b-17de-408a-b874-0445c86b69e6" + } +} +``` diff --git a/skills/azure-landing-zone-architect/SKILL.md b/skills/azure-landing-zone-architect/SKILL.md new file mode 100644 index 00000000..0c71344a --- /dev/null +++ b/skills/azure-landing-zone-architect/SKILL.md @@ -0,0 +1,109 @@ +--- +name: azure-landing-zone-architect +description: | + Design and evolve Azure Landing Zones following Microsoft's Cloud Adoption Framework. + Use when: + (1) Designing a new Azure platform foundation or landing zone + (2) Evaluating or evolving an existing landing zone architecture + (3) Planning identity, networking, governance, or security design areas + (4) Implementing hub-spoke or Virtual WAN topologies + (5) Setting up management groups, policies, and subscription organization + (6) Designing platform vs application landing zones + Triggers: landing zone, ALZ, Cloud Adoption Framework, CAF, platform design, + management groups, hub-spoke, Virtual WAN, subscription vending, governance +--- + +# Azure Landing Zone Architect + +Design Azure platforms that are secure, scalable, and governed from day one. + +## Landing Zone Conceptual Architecture + +``` +Tenant Root Group +├── Platform +│ ├── Management # Logging, monitoring, automation +│ ├── Identity # Azure AD, domain controllers +│ └── Connectivity # Hub networking, DNS, firewall +└── Landing Zones + ├── Corp # Internal workloads (private connectivity) + ├── Online # Internet-facing workloads + └── Sandbox # Development/experimentation +``` + +## Design Areas + +### 1. Identity and Access Management +- Azure AD tenant design +- Privileged Identity Management (PIM) +- Conditional Access policies +- Hybrid identity with AD DS + +### 2. Network Topology and Connectivity +- Hub-spoke vs Virtual WAN +- Private DNS zones +- ExpressRoute/VPN connectivity +- Azure Firewall or NVA placement + +### 3. Resource Organization +- Management group hierarchy +- Subscription design patterns +- Naming and tagging standards +- Resource group strategies + +### 4. Governance and Compliance +- Azure Policy assignments +- Regulatory compliance (SOC2, ISO, HIPAA) +- Cost management boundaries +- Blueprints and guardrails + +### 5. Security +- Microsoft Defender for Cloud +- Network segmentation +- Encryption standards +- Security baseline policies + +### 6. Management and Monitoring +- Log Analytics workspace topology +- Azure Monitor configuration +- Automation accounts +- Update management + +### 7. Platform Automation +- Infrastructure as Code strategy +- CI/CD for platform +- Subscription vending automation +- GitOps for policy + +### 8. Business Continuity +- Backup policies +- Disaster recovery regions +- RPO/RTO requirements +- Cross-region replication + +## Quick Decision Framework + +| Question | If Yes → | If No → | +|----------|----------|---------| +| Multi-region with >50 VNets? | Virtual WAN | Hub-spoke | +| Need SD-WAN integration? | Virtual WAN | Hub-spoke | +| Complex routing requirements? | Azure Firewall | NSGs + UDRs | +| Regulatory compliance? | Dedicated subscriptions | Shared with policies | +| >10 application teams? | Subscription vending | Manual provisioning | + +## Platform Landing Zone Checklist + +- [ ] Management group hierarchy defined +- [ ] Subscription naming convention established +- [ ] Connectivity model selected (hub-spoke/vWAN) +- [ ] Identity integration planned +- [ ] Policy baseline selected +- [ ] Logging architecture designed +- [ ] BCDR strategy documented + +## References + +- **Identity design**: See [references/identity.md](references/identity.md) +- **Network topology**: See [references/networking.md](references/networking.md) +- **Governance patterns**: See [references/governance.md](references/governance.md) +- **Security baseline**: See [references/security.md](references/security.md) diff --git a/skills/azure-landing-zone-architect/references/governance.md b/skills/azure-landing-zone-architect/references/governance.md new file mode 100644 index 00000000..7e5c7268 --- /dev/null +++ b/skills/azure-landing-zone-architect/references/governance.md @@ -0,0 +1,229 @@ +# Governance and Compliance Design + +## Table of Contents +1. [Management Group Hierarchy](#management-group-hierarchy) +2. [Subscription Organization](#subscription-organization) +3. [Azure Policy Strategy](#azure-policy-strategy) +4. [Naming and Tagging](#naming-and-tagging) +5. [Cost Management](#cost-management) + +## Management Group Hierarchy + +### Recommended Structure +``` +Tenant Root Group +│ +├── Platform +│ ├── Management +│ │ └── sub-management-001 +│ ├── Identity +│ │ └── sub-identity-001 +│ └── Connectivity +│ └── sub-connectivity-001 +│ +├── Landing Zones +│ ├── Corp +│ │ ├── sub-app-finance-prod +│ │ ├── sub-app-finance-dev +│ │ └── sub-app-hr-prod +│ ├── Online +│ │ ├── sub-web-marketing-prod +│ │ └── sub-web-ecommerce-prod +│ └── Confidential +│ └── sub-app-pii-prod +│ +├── Sandbox +│ └── sub-sandbox-team-a +│ +└── Decommissioned + └── (subscriptions pending deletion) +``` + +### Management Group Purpose + +| Management Group | Purpose | Policy Focus | +|------------------|---------|--------------| +| Platform | Core infrastructure | Strict security, no workloads | +| Management | Monitoring, automation | Log collection, diagnostics | +| Identity | Identity services | AD DS, Azure AD Connect | +| Connectivity | Networking | Hub VNets, gateways, firewall | +| Landing Zones | Application workloads | Balanced security/agility | +| Corp | Internal apps | Private connectivity required | +| Online | Internet-facing | WAF, DDoS protection | +| Confidential | Sensitive data | Encryption, data residency | +| Sandbox | Experimentation | Relaxed policies, no prod data | +| Decommissioned | Cleanup staging | Deny all deployments | + +## Subscription Organization + +### Subscription Design Patterns + +| Pattern | When to Use | Example | +|---------|-------------|---------| +| Environment-based | Simple workloads | sub-app-prod, sub-app-dev | +| Workload-based | Complex apps | sub-frontend, sub-backend | +| Team-based | Autonomous teams | sub-team-platform, sub-team-data | +| Hybrid | Large organizations | Combine patterns per need | + +### Subscription Limits to Consider +- 980 resource groups per subscription +- 800 deployments per resource group +- Varies by resource type (check quotas) + +### Subscription Vending +Automate subscription creation with: +1. Request via ServiceNow/form +2. Approval workflow +3. Subscription creation (Bicep/Terraform) +4. Baseline policy assignment +5. Initial RBAC setup +6. Network peering to hub +7. Handoff to application team + +## Azure Policy Strategy + +### Policy Assignment Hierarchy +``` +Tenant Root ──► Security baseline (audit mode) + │ + ├── Platform ──► Strict security (deny mode) + │ + ├── Landing Zones ──► Workload policies + │ ├── Corp ──► Private endpoint required + │ └── Online ──► WAF required + │ + └── Sandbox ──► Minimal restrictions +``` + +### Essential Policy Initiatives + +#### Security Baseline +- Require HTTPS on storage accounts +- Require TLS 1.2 minimum +- Deny public blob access +- Require managed disk encryption +- Enable Microsoft Defender for Cloud + +#### Network Governance +- Allowed virtual network locations +- Require NSG on subnets +- Deny public IPs (except approved) +- Require private endpoints for PaaS + +#### Tag Governance +- Require cost center tag +- Require environment tag +- Require owner tag +- Inherit tags from resource group + +#### Cost Control +- Allowed VM SKUs +- Allowed storage SKUs +- Deny expensive services in sandbox + +### Policy Definition Example +```json +{ + "mode": "All", + "policyRule": { + "if": { + "allOf": [ + { + "field": "type", + "equals": "Microsoft.Storage/storageAccounts" + }, + { + "field": "Microsoft.Storage/storageAccounts/supportsHttpsTrafficOnly", + "notEquals": true + } + ] + }, + "then": { + "effect": "deny" + } + } +} +``` + +### Policy Rollout Strategy +1. **Deploy in Audit mode** - Assess impact +2. **Review compliance** - Identify non-compliant resources +3. **Remediate existing** - Fix or exempt resources +4. **Switch to Deny mode** - Enforce going forward +5. **Monitor continuously** - Review compliance dashboard + +## Naming and Tagging + +### Naming Convention +``` +{resource-type}-{workload}-{environment}-{region}-{instance} +``` + +| Component | Values | Example | +|-----------|--------|---------| +| resource-type | rg, vnet, st, kv, app | rg | +| workload | app name or function | payments | +| environment | prod, dev, staging, test | prod | +| region | eastus, westus2, etc. | eastus | +| instance | 001, 002 (if multiple) | 001 | + +**Example**: `rg-payments-prod-eastus-001` + +### Required Tags + +| Tag | Purpose | Example Values | +|-----|---------|----------------| +| Environment | Deployment stage | prod, dev, staging | +| CostCenter | Billing allocation | CC-12345 | +| Owner | Responsible team/person | team-payments@company.com | +| Application | Application name | PaymentGateway | +| DataClassification | Data sensitivity | public, internal, confidential | +| CreatedDate | Resource creation | 2026-01-22 | +| Automation | IaC managed | terraform, bicep, manual | + +### Tag Inheritance +```bicep +// Resource group tags flow to resources +resource rg 'Microsoft.Resources/resourceGroups@2023-07-01' = { + name: 'rg-payments-prod' + location: location + tags: { + Environment: 'prod' + CostCenter: 'CC-12345' + Owner: 'team-payments@company.com' + } +} +``` + +## Cost Management + +### Budget Alerts +``` +Subscription Budget: +├── 50% threshold → Email notification +├── 75% threshold → Email + Teams notification +├── 90% threshold → Email + Teams + trigger automation +└── 100% threshold → Email + Teams + potential scale-down +``` + +### Cost Allocation +- Tag resources with CostCenter +- Use cost allocation rules for shared services +- Create cost views per application/team +- Schedule weekly/monthly reports + +### Reserved Instances Strategy +| Resource | Commitment | Savings | +|----------|------------|---------| +| VMs (steady state) | 3-year | ~60% | +| SQL Database | 1-year | ~30% | +| Cosmos DB | 1-year | ~20% | +| App Service | 1-year | ~35% | + +### Cost Optimization Checklist +- [ ] Right-size VMs based on metrics +- [ ] Use auto-shutdown for dev/test +- [ ] Delete orphaned disks and NICs +- [ ] Use spot VMs for batch workloads +- [ ] Enable storage lifecycle policies +- [ ] Review advisor recommendations weekly diff --git a/skills/azure-landing-zone-architect/references/identity.md b/skills/azure-landing-zone-architect/references/identity.md new file mode 100644 index 00000000..14de40dc --- /dev/null +++ b/skills/azure-landing-zone-architect/references/identity.md @@ -0,0 +1,171 @@ +# Identity and Access Management Design + +## Table of Contents +1. [Azure AD Tenant Design](#azure-ad-tenant-design) +2. [Management Group RBAC](#management-group-rbac) +3. [Privileged Identity Management](#privileged-identity-management) +4. [Hybrid Identity](#hybrid-identity) +5. [Conditional Access](#conditional-access) + +## Azure AD Tenant Design + +### Single vs Multiple Tenants + +| Factor | Single Tenant | Multiple Tenants | +|--------|---------------|------------------| +| Collaboration | Seamless | Requires B2B | +| Administration | Centralized | Distributed | +| Compliance isolation | Via subscriptions | Complete separation | +| Recommended | Most organizations | Strict regulatory needs | + +### Tenant Configuration Baseline +- Enable Security Defaults or Conditional Access +- Configure Azure AD Identity Protection +- Enable Azure AD audit logs to Log Analytics +- Implement break-glass accounts (2 minimum) +- Disable legacy authentication + +## Management Group RBAC + +### Built-in Roles by Scope + +| Role | Scope | Use Case | +|------|-------|----------| +| Owner | Tenant Root | Platform team only | +| Contributor | Landing Zone MG | Application teams | +| Reader | Platform MG | Security/compliance teams | +| Network Contributor | Connectivity sub | NetOps team | +| Security Admin | Tenant | SecOps team | + +### Custom Role: Landing Zone Owner +```json +{ + "Name": "Landing Zone Owner", + "Description": "Full control within landing zone, no policy changes", + "Actions": [ + "Microsoft.Resources/*", + "Microsoft.Compute/*", + "Microsoft.Storage/*", + "Microsoft.Network/*", + "Microsoft.Web/*", + "Microsoft.Sql/*" + ], + "NotActions": [ + "Microsoft.Authorization/policyAssignments/*", + "Microsoft.Authorization/policyDefinitions/*", + "Microsoft.Network/virtualNetworks/virtualNetworkPeerings/*" + ], + "AssignableScopes": ["/providers/Microsoft.Management/managementGroups/landing-zones"] +} +``` + +### Custom Role: Subscription Vending +```json +{ + "Name": "Subscription Creator", + "Description": "Create and manage subscriptions", + "Actions": [ + "Microsoft.Subscription/aliases/*", + "Microsoft.Management/managementGroups/subscriptions/*", + "Microsoft.Resources/subscriptions/read" + ], + "AssignableScopes": ["/providers/Microsoft.Management/managementGroups/root"] +} +``` + +## Privileged Identity Management + +### PIM Configuration for Platform Roles + +| Role | Activation Duration | Require Approval | MFA Required | +|------|---------------------|------------------|--------------| +| Global Admin | 2 hours | Yes | Yes | +| Owner (Tenant Root) | 4 hours | Yes | Yes | +| Contributor (Platform) | 8 hours | No | Yes | +| Reader (any) | Permanent eligible | No | Yes | + +### Break-Glass Account Configuration +1. Create 2 cloud-only accounts (no hybrid sync) +2. Exclude from all Conditional Access policies +3. Assign Global Administrator permanently +4. Store credentials in physical safe +5. Monitor sign-ins with alerts +6. Test quarterly + +## Hybrid Identity + +### Identity Models + +| Model | Description | Recommended When | +|-------|-------------|------------------| +| Cloud-only | All identities in Azure AD | New organizations, no on-prem | +| Synchronized | AD DS synced to Azure AD | Existing AD DS investment | +| Federated | ADFS or PingFederate | Strict on-prem auth requirements | + +### Azure AD Connect Deployment +``` +On-premises Azure +┌─────────────────┐ ┌─────────────────┐ +│ AD DS │ │ Azure AD │ +│ Domain │◄───────────────►│ Tenant │ +│ Controllers │ Azure AD │ │ +└─────────────────┘ Connect └─────────────────┘ + │ Sync │ + │ │ + ▼ ▼ +┌─────────────────┐ ┌─────────────────┐ +│ Users │ │ Cloud Apps │ +│ Groups │ │ Azure RBAC │ +│ Computers │ │ Conditional │ +└─────────────────┘ └─────────────────┘ +``` + +### Recommended Sync Settings +- Password Hash Sync: **Enabled** (required for Identity Protection) +- Password Writeback: **Enabled** (for SSPR) +- Device Writeback: **Optional** (for Hybrid Azure AD Join) +- Filtering: By OU or group (not attribute-based) + +## Conditional Access + +### Baseline Policies (Minimum Required) + +1. **Require MFA for admins** + - Users: Directory roles (Global Admin, etc.) + - Cloud apps: All + - Grant: Require MFA + +2. **Require MFA for Azure Management** + - Users: All + - Cloud apps: Microsoft Azure Management + - Grant: Require MFA + +3. **Block legacy authentication** + - Users: All + - Cloud apps: All + - Conditions: Client apps = Other clients + - Grant: Block + +4. **Require compliant devices for sensitive apps** + - Users: All + - Cloud apps: Select sensitive apps + - Grant: Require compliant device + +### Named Locations +``` +Corporate Networks: +- 203.0.113.0/24 (HQ) +- 198.51.100.0/24 (Branch) + +Trusted Countries: +- United States +- United Kingdom +- (Add as needed) +``` + +### Emergency Access Exclusions +Always exclude break-glass accounts from: +- MFA requirements +- Device compliance +- Location-based policies +- Risk-based policies diff --git a/skills/azure-landing-zone-architect/references/networking.md b/skills/azure-landing-zone-architect/references/networking.md new file mode 100644 index 00000000..ba951da3 --- /dev/null +++ b/skills/azure-landing-zone-architect/references/networking.md @@ -0,0 +1,211 @@ +# Network Topology and Connectivity Design + +## Table of Contents +1. [Topology Selection](#topology-selection) +2. [Hub-Spoke Architecture](#hub-spoke-architecture) +3. [Virtual WAN Architecture](#virtual-wan-architecture) +4. [Connectivity Patterns](#connectivity-patterns) +5. [DNS Design](#dns-design) +6. [IP Address Management](#ip-address-management) + +## Topology Selection + +### Decision Matrix + +| Requirement | Hub-Spoke | Virtual WAN | +|-------------|-----------|-------------| +| < 30 VNets | ✅ Recommended | ⚠️ Overkill | +| 30-200 VNets | ⚠️ Complex | ✅ Recommended | +| > 200 VNets | ❌ Not scalable | ✅ Required | +| SD-WAN integration | ❌ Manual | ✅ Native | +| Custom routing | ✅ Full control | ⚠️ Limited | +| Cost sensitivity | ✅ Lower | ⚠️ Higher base cost | +| Multi-region | ⚠️ Complex | ✅ Simplified | + +### Hybrid Approach +Start with hub-spoke, migrate to Virtual WAN when: +- VNet count exceeds 50 +- Multi-region connectivity becomes complex +- SD-WAN integration is required + +## Hub-Spoke Architecture + +### Reference Design +``` + ┌─────────────────────────────────────┐ + │ On-premises Network │ + └───────────────┬─────────────────────┘ + │ ExpressRoute/VPN + ┌───────────────▼─────────────────────┐ + │ Hub VNet │ + │ ┌─────────┐ ┌─────────────────┐ │ + │ │ Gateway │ │ Azure Firewall │ │ + │ │ Subnet │ │ Subnet │ │ + │ └─────────┘ └─────────────────┘ │ + │ ┌─────────────────────────────┐ │ + │ │ Shared Services │ │ + │ │ (DNS, AD DS, Jump boxes) │ │ + │ └─────────────────────────────┘ │ + └──────┬─────────────┬───────────────┘ + │ Peering │ Peering + ┌────────────▼──┐ ┌───▼────────────┐ + │ Spoke VNet │ │ Spoke VNet │ + │ (Workload A) │ │ (Workload B) │ + └───────────────┘ └────────────────┘ +``` + +### Hub VNet Subnets + +| Subnet | Size | Purpose | +|--------|------|---------| +| GatewaySubnet | /27 minimum | VPN/ExpressRoute gateways | +| AzureFirewallSubnet | /26 required | Azure Firewall | +| AzureBastionSubnet | /26 minimum | Azure Bastion | +| SharedServices | /24 | DNS forwarders, DCs, jump boxes | +| Management | /24 | Automation, monitoring agents | + +### Spoke VNet Subnets + +| Subnet | Size | Purpose | +|--------|------|---------| +| Application | /24 | App tier | +| Data | /24 | Database tier | +| PrivateEndpoints | /24 | Private endpoints | +| AppGateway | /24 | Application Gateway (if needed) | + +### Peering Configuration +```bicep +resource hubToSpokePeering 'Microsoft.Network/virtualNetworks/virtualNetworkPeerings@2023-05-01' = { + name: 'hub-to-spoke-${spokeName}' + parent: hubVnet + properties: { + remoteVirtualNetwork: { id: spokeVnet.id } + allowVirtualNetworkAccess: true + allowForwardedTraffic: true + allowGatewayTransit: true // Hub has gateway + useRemoteGateways: false + } +} + +resource spokeToHubPeering 'Microsoft.Network/virtualNetworks/virtualNetworkPeerings@2023-05-01' = { + name: 'spoke-to-hub' + parent: spokeVnet + properties: { + remoteVirtualNetwork: { id: hubVnet.id } + allowVirtualNetworkAccess: true + allowForwardedTraffic: true + allowGatewayTransit: false + useRemoteGateways: true // Use hub's gateway + } +} +``` + +## Virtual WAN Architecture + +### Reference Design +``` +┌─────────────────────────────────────────────────────────────┐ +│ Virtual WAN │ +│ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ Hub (East) │◄────────────►│ Hub (West) │ │ +│ │ │ Global │ │ │ +│ │ - VPN Gateway │ Transit │ - VPN Gateway │ │ +│ │ - ER Gateway │ │ - ER Gateway │ │ +│ │ - Firewall │ │ - Firewall │ │ +│ └───────┬─────────┘ └───────┬─────────┘ │ +└──────────┼────────────────────────────────┼────────────────┘ + │ │ + ┌──────▼──────┐ ┌──────▼──────┐ + │ Spoke VNets │ │ Spoke VNets │ + │ (East US) │ │ (West US) │ + └─────────────┘ └─────────────┘ +``` + +### Virtual Hub Configuration + +| Setting | Recommended Value | +|---------|-------------------| +| Hub address space | /23 (512 addresses) | +| VPN Gateway scale units | Start with 1, scale as needed | +| ExpressRoute scale units | Match circuit bandwidth | +| Firewall tier | Premium for IDPS | +| Routing preference | ExpressRoute > VPN | + +## Connectivity Patterns + +### ExpressRoute Configuration +- Use ExpressRoute Global Reach for site-to-site +- Deploy redundant circuits (2 circuits, 2 peering locations) +- Enable FastPath for performance-sensitive workloads +- Configure BFD for faster failover + +### VPN Configuration +- Use IKEv2 for stability +- Configure BGP for dynamic routing +- Deploy active-active gateways +- Use custom IPsec policies (AES256, SHA256) + +### Private Link Strategy +``` +Public Service Private Endpoint Consumer VNet +┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ +│ Storage │◄──────│ privatelink. │◄─────│ Application │ +│ Account │ │ blob.core. │ │ │ +└─────────────┘ │ windows.net │ └─────────────┘ + └─────────────────┘ +``` + +## DNS Design + +### Private DNS Zones (Centralized in Hub) + +| Service | Zone Name | +|---------|-----------| +| Blob Storage | privatelink.blob.core.windows.net | +| Key Vault | privatelink.vaultcore.azure.net | +| SQL Database | privatelink.database.windows.net | +| Cosmos DB | privatelink.documents.azure.com | +| ACR | privatelink.azurecr.io | +| App Service | privatelink.azurewebsites.net | + +### DNS Resolution Flow +``` +1. VM queries DNS (168.63.129.16) +2. Azure DNS forwards to Private DNS Zone +3. Private DNS returns private endpoint IP +4. VM connects via private IP +``` + +### Hybrid DNS +``` +On-premises DNS ──► Azure DNS Private Resolver ──► Private DNS Zones + │ + ▼ + Azure-provided DNS + (168.63.129.16) +``` + +## IP Address Management + +### Address Space Planning + +| Environment | Range | Notes | +|-------------|-------|-------| +| Hub VNets | 10.0.0.0/16 | Per region | +| Spoke VNets | 10.1.0.0/16 - 10.255.0.0/16 | By subscription/workload | +| On-premises | 192.168.0.0/16 | Existing | +| Reserved | 172.16.0.0/12 | Future growth | + +### Subnet Calculator +``` +/16 = 65,536 addresses = 256 x /24 subnets +/24 = 256 addresses (251 usable, Azure reserves 5) +/26 = 64 addresses (59 usable) +/27 = 32 addresses (27 usable) +``` + +### IP Addressing Best Practices +- Never overlap with on-premises ranges +- Reserve space for future growth (2x current needs) +- Use contiguous ranges per region for summarization +- Document all allocations in IPAM tool diff --git a/skills/azure-landing-zone-architect/references/security.md b/skills/azure-landing-zone-architect/references/security.md new file mode 100644 index 00000000..86b8bcd9 --- /dev/null +++ b/skills/azure-landing-zone-architect/references/security.md @@ -0,0 +1,234 @@ +# Security Baseline Design + +## Table of Contents +1. [Microsoft Defender for Cloud](#microsoft-defender-for-cloud) +2. [Network Security](#network-security) +3. [Identity Security](#identity-security) +4. [Data Protection](#data-protection) +5. [Security Operations](#security-operations) + +## Microsoft Defender for Cloud + +### Enable Defender Plans + +| Plan | Enable For | Priority | +|------|------------|----------| +| Defender for Servers | All VMs | High | +| Defender for Storage | All storage accounts | High | +| Defender for SQL | All SQL databases | High | +| Defender for Containers | AKS clusters | High | +| Defender for Key Vault | All Key Vaults | High | +| Defender for App Service | All web apps | Medium | +| Defender for ARM | Subscription level | Medium | +| Defender for DNS | Subscription level | Medium | + +### Secure Score Targets + +| Level | Score | Timeline | +|-------|-------|----------| +| Minimum | 60% | Immediate | +| Target | 80% | 3 months | +| Optimal | 90%+ | 6 months | + +### Security Baseline Policy +Assign at Platform management group: +```json +{ + "displayName": "Enable Defender for Cloud", + "policyDefinitionId": "/providers/Microsoft.Authorization/policySetDefinitions/1f3afdf9-d0c9-4c3d-847f-89da613e70a8" +} +``` + +## Network Security + +### Defense in Depth +``` +Internet + │ + ▼ +┌─────────────────┐ +│ Azure DDoS │ Layer 3/4 protection +│ Protection │ +└────────┬────────┘ + │ + ┌────▼────┐ + │ WAF │ Layer 7 protection + │ (AGW) │ OWASP rules + └────┬────┘ + │ +┌────────▼────────┐ +│ Azure Firewall │ Network filtering +│ (Hub VNet) │ FQDN rules, threat intel +└────────┬────────┘ + │ + ┌────▼────┐ + │ NSG │ Subnet/NIC filtering + └────┬────┘ + │ + ┌────▼────┐ + │ App │ Application controls + └─────────┘ +``` + +### NSG Rules Template +```bicep +resource nsg 'Microsoft.Network/networkSecurityGroups@2023-05-01' = { + name: 'nsg-${workload}-${environment}' + location: location + properties: { + securityRules: [ + { + name: 'AllowHTTPS' + properties: { + priority: 100 + direction: 'Inbound' + access: 'Allow' + protocol: 'Tcp' + sourceAddressPrefix: 'VirtualNetwork' + destinationPortRange: '443' + destinationAddressPrefix: '*' + sourcePortRange: '*' + } + } + { + name: 'DenyAllInbound' + properties: { + priority: 4096 + direction: 'Inbound' + access: 'Deny' + protocol: '*' + sourceAddressPrefix: '*' + destinationPortRange: '*' + destinationAddressPrefix: '*' + sourcePortRange: '*' + } + } + ] + } +} +``` + +### Azure Firewall Rules + +| Rule Collection | Priority | Rules | +|-----------------|----------|-------| +| Platform-Allow | 100 | Azure services, Windows Update | +| App-Allow | 200 | Application-specific FQDNs | +| Default-Deny | 65000 | Deny all (implicit) | + +### Private Endpoint Strategy +All PaaS services must use private endpoints: +- Storage accounts +- Key Vault +- SQL Database +- Cosmos DB +- Container Registry +- App Services (when applicable) + +## Identity Security + +### Privileged Access Workstations (PAW) +``` +Tier 0: Domain Controllers, Azure AD + └── PAW required for all access + +Tier 1: Member Servers, Azure resources + └── PAW or secure jumpbox + +Tier 2: End user devices + └── Standard workstations +``` + +### Just-In-Time Access +Enable JIT for all management VMs: +- Maximum 3-hour access window +- Require justification +- Automatic NSG rule removal +- Audit all access requests + +### Service Principal Security +- Use managed identities over service principals +- If SP required: certificate auth, not secrets +- Rotate credentials every 90 days +- Minimum required permissions only +- Monitor sign-in logs + +## Data Protection + +### Encryption Requirements + +| Data State | Encryption | Key Management | +|------------|------------|----------------| +| At rest | AES-256 | Platform or CMK | +| In transit | TLS 1.2+ | Azure managed | +| In use | Confidential (when required) | Customer managed | + +### Key Vault Configuration +```bicep +resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = { + name: keyVaultName + properties: { + tenantId: subscription().tenantId + sku: { family: 'A', name: 'premium' } // HSM-backed + enableRbacAuthorization: true + enablePurgeProtection: true // NEVER disable + enableSoftDelete: true + softDeleteRetentionInDays: 90 + publicNetworkAccess: 'Disabled' + networkAcls: { + defaultAction: 'Deny' + bypass: 'AzureServices' + } + } +} +``` + +### Data Classification Policies + +| Classification | Controls | +|----------------|----------| +| Public | Standard encryption | +| Internal | Encryption + private network | +| Confidential | CMK + private endpoint + audit | +| Restricted | CMK + private endpoint + DLP + CASB | + +## Security Operations + +### Log Collection +Collect these logs in central Log Analytics: + +| Log Type | Source | Retention | +|----------|--------|-----------| +| Activity Log | All subscriptions | 90 days | +| Azure AD Sign-ins | Tenant | 90 days | +| Azure AD Audit | Tenant | 90 days | +| NSG Flow Logs | All NSGs | 30 days | +| Diagnostic Logs | All resources | 30 days | +| Azure Firewall Logs | Hub firewall | 90 days | + +### Security Alerts +Configure alerts for: +- Failed login attempts (>5 in 10 min) +- Privileged role activations +- Policy violations +- Defender for Cloud high severity +- Unusual network traffic patterns + +### Incident Response +``` +Detection → Triage → Containment → Eradication → Recovery → Lessons Learned + │ │ │ │ │ │ + ▼ ▼ ▼ ▼ ▼ ▼ + Sentinel Severity Isolate Remove Restore Document + Alerts Assessment Resource Threat Services & Improve +``` + +### Security Review Cadence + +| Review | Frequency | Participants | +|--------|-----------|--------------| +| Secure Score | Weekly | Security team | +| Access Reviews | Monthly | Resource owners | +| Policy Compliance | Monthly | Governance team | +| Penetration Test | Annually | Third party | +| Architecture Review | Quarterly | Security + Platform | diff --git a/skills/azure-waf-assessment/SKILL.md b/skills/azure-waf-assessment/SKILL.md new file mode 100644 index 00000000..2f311182 --- /dev/null +++ b/skills/azure-waf-assessment/SKILL.md @@ -0,0 +1,160 @@ +--- +name: azure-waf-assessment +description: | + Guided Q&A workflow for conducting Azure Well-Architected Framework assessments. + Use this skill when: + - Running interactive architecture assessments with stakeholders + - Gathering structured input about Azure workloads + - Conducting WAF pillar-specific deep dives + - Documenting assessment findings systematically + This skill provides the conversation structure, question frameworks, and output templates + for assessments. For interpretation of findings, use azure-architecture-waf-review. +--- + +# Azure Architecture Assessment Runner + +Conduct structured Well-Architected Framework assessments through guided Q&A sessions. + +## Assessment Types + +1. **Full WAF Assessment** - All 5 pillars, 60-90 minutes +2. **Pillar Deep Dive** - Single pillar focus, 30-45 minutes +3. **Quick Health Check** - Key questions only, 15-20 minutes +4. **Pre-Production Review** - Launch readiness, 45-60 minutes + +## Starting an Assessment + +Before beginning, gather: +- Workload name and business purpose +- Architecture diagram (if available) +- Current Azure services in use +- Team roles participating + +## Assessment Flow + +``` +1. Context Gathering (5 min) + └── Workload overview, criticality, current state + +2. Pillar Assessment (varies) + ├── Reliability Questions + ├── Security Questions + ├── Cost Optimization Questions + ├── Operational Excellence Questions + └── Performance Efficiency Questions + +3. Finding Synthesis (10 min) + └── Summarize gaps, prioritize recommendations + +4. Action Planning (10 min) + └── Assign owners, set timelines +``` + +## Question Framework + +For each question: +1. Ask the question clearly +2. Listen to the response +3. Probe for details if answer is vague +4. Note the maturity level (1-5) +5. Record specific gaps or risks + +### Maturity Levels +- **1 - Ad Hoc**: No process, reactive only +- **2 - Developing**: Some awareness, inconsistent +- **3 - Defined**: Documented process exists +- **4 - Managed**: Metrics tracked, continuous improvement +- **5 - Optimized**: Industry-leading practices + +## Core Questions by Pillar + +See [references/questions.md](references/questions.md) for the complete question bank. + +### Quick Health Check Questions (1-2 per pillar) + +**Reliability** +> "If your primary region became unavailable right now, what would happen to this workload?" + +**Security** +> "How do you manage access credentials and secrets for this application?" + +**Cost** +> "Do you have visibility into what this workload costs monthly, and who reviews it?" + +**Operations** +> "How do you deploy changes to production, and how would you roll back a bad deployment?" + +**Performance** +> "How do you know when this application is performing well vs. struggling?" + +## Documenting Findings + +### Finding Template +```markdown +## Finding: [Short Title] + +**Pillar**: [Reliability|Security|Cost|Operations|Performance] +**Severity**: [Critical|High|Medium|Low] +**Current State**: [What exists today] +**Gap**: [What's missing or inadequate] +**Risk**: [What could go wrong] +**Recommendation**: [What to do] +**Effort**: [S/M/L] +**Priority**: [P1/P2/P3] +``` + +### Assessment Report Structure + +See [references/report-template.md](references/report-template.md) for the full template. + +```markdown +# WAF Assessment Report: [Workload Name] + +## Executive Summary +- Assessment date and participants +- Overall maturity score +- Top 3-5 priority recommendations + +## Pillar Scores +| Pillar | Score (1-5) | Key Gap | +|--------|-------------|---------| +| Reliability | X | ... | +| Security | X | ... | +| Cost | X | ... | +| Operations | X | ... | +| Performance | X | ... | + +## Detailed Findings +[Organized by pillar] + +## Recommendations Roadmap +[Prioritized action items with owners] + +## Appendix +- Full question responses +- Architecture diagrams +- Reference links +``` + +## Probing Techniques + +When answers are vague: +- "Can you walk me through what happens when...?" +- "Who is responsible for that, and how do they know it's working?" +- "When was the last time you tested that?" +- "What would you do if that failed at 2 AM?" +- "How would a new team member learn about this?" + +## Session Management + +### Opening Script +> "Today we're going to review [workload] against the Azure Well-Architected Framework. This isn't an audit—it's a collaborative discussion to identify improvement opportunities. I'll ask questions across five areas: reliability, security, cost, operations, and performance. There are no wrong answers; I'm trying to understand your current state. Let's start with an overview of the workload." + +### Closing Script +> "Thank you for your time. Based on our discussion, I've identified [N] findings across the five pillars. The top priorities appear to be [1, 2, 3]. I'll compile a detailed report with recommendations and share it by [date]. Any questions before we wrap up?" + +## References + +- [questions.md](references/questions.md) - Complete question bank by pillar +- [report-template.md](references/report-template.md) - Full assessment report template +- [scoring-guide.md](references/scoring-guide.md) - Maturity scoring criteria diff --git a/skills/azure-waf-assessment/references/questions.md b/skills/azure-waf-assessment/references/questions.md new file mode 100644 index 00000000..945ded34 --- /dev/null +++ b/skills/azure-waf-assessment/references/questions.md @@ -0,0 +1,151 @@ +# WAF Assessment Question Bank + +Complete question set for Azure Well-Architected Framework assessments, organized by pillar. + +## Reliability Questions + +### Availability Design +1. What is the target SLA/SLO for this workload? +2. What availability zones or regions does this workload span? +3. How do critical components handle zone or region failures? +4. What single points of failure exist in the architecture? +5. How is state managed if a component fails mid-transaction? + +### Disaster Recovery +6. What is the defined RPO (Recovery Point Objective)? +7. What is the defined RTO (Recovery Time Objective)? +8. Where are backups stored, and how often are they tested? +9. Is there a documented DR runbook? +10. When was the last DR test, and what was the outcome? + +### Resilience +11. How does the application handle downstream service failures? +12. Are retry policies and circuit breakers implemented? +13. How do you handle transient failures (network blips, throttling)? +14. What happens if the database becomes temporarily unavailable? +15. How does the system degrade gracefully under partial failure? + +### Testing +16. Do you perform chaos engineering or fault injection testing? +17. How do you validate that failover mechanisms work? +18. Are load tests run to verify behavior under stress? + +## Security Questions + +### Identity & Access +1. How do users authenticate to this application? +2. How do services authenticate to each other and to Azure resources? +3. Are managed identities used where possible? +4. What is the process for granting access to production? +5. How often are access permissions reviewed? +6. Is privileged access time-bound (PIM/JIT)? + +### Data Protection +7. How is data classified (public, internal, confidential, restricted)? +8. Is data encrypted at rest? What mechanisms? +9. Is data encrypted in transit? TLS version? +10. How are encryption keys managed and rotated? +11. Where are secrets stored, and how are they accessed? + +### Network Security +12. Is the workload deployed in a VNet? +13. Are private endpoints used for PaaS services? +14. What network segmentation exists (NSGs, subnets)? +15. Is there a WAF in front of public endpoints? +16. How is egress traffic controlled and monitored? + +### Threat Detection +17. Is Microsoft Defender for Cloud enabled? +18. How are security alerts monitored and responded to? +19. Is there a defined incident response process? +20. How do you track vulnerabilities in dependencies? + +## Cost Optimization Questions + +### Visibility +1. Do you know what this workload costs monthly? +2. How are costs tracked and allocated (tags, subscriptions)? +3. Who reviews cost reports, and how often? +4. Are cost anomaly alerts configured? +5. Are there defined budgets with alerts? + +### Right-Sizing +6. When were compute resources last right-sized? +7. Are auto-scaling policies in place? +8. How do you identify underutilized resources? +9. Are dev/test environments scaled down after hours? + +### Commitment Discounts +10. Are Reserved Instances used for predictable workloads? +11. Are Savings Plans used for compute? +12. Are Spot VMs used where appropriate? + +### Architecture Efficiency +13. Are PaaS services used where possible vs. IaaS? +14. Is there unnecessary data duplication or redundancy? +15. Are there zombie resources (unused disks, IPs, etc.)? +16. Could serverless options reduce cost for variable workloads? + +## Operational Excellence Questions + +### DevOps Practices +1. Is all infrastructure defined as code (Bicep/Terraform)? +2. What percentage of infrastructure is managed via IaC? +3. How are changes deployed to production? +4. How long does a typical deployment take? +5. How do you roll back a failed deployment? + +### Source Control +6. Is all code in source control? +7. Are pull requests required for changes? +8. Is there a branch protection strategy? + +### Monitoring +9. What metrics are collected for this workload? +10. What is alerted on, and who receives alerts? +11. How do you know when the system is healthy vs. degraded? +12. Is there a dashboard for operational visibility? +13. Are logs centralized? Where? + +### Incident Management +14. Is there an on-call rotation? +15. Is there a defined incident response process? +16. How are incidents documented and reviewed (post-mortems)? +17. What was the last major incident, and what did you learn? + +### Documentation +18. Is there a runbook for common operational tasks? +19. How does a new team member get onboarded? +20. Is the architecture documented and kept current? + +## Performance Efficiency Questions + +### Baseline +1. What are the key performance metrics for this workload? +2. What are the defined performance targets (latency, throughput)? +3. How do you know if performance is acceptable? +4. When did you last performance test this workload? + +### Compute Tier +5. How was the current compute tier selected? +6. Is auto-scaling enabled? What triggers scaling? +7. What is the typical CPU/memory utilization? +8. Are there known compute bottlenecks? + +### Data Tier +9. What database technology is used, and why? +10. Is the database tier appropriately sized? +11. Are queries optimized? When were they last reviewed? +12. Is caching used to reduce database load? +13. Are indexes optimized for query patterns? + +### Network +14. What is the typical latency for key operations? +15. Is CDN used for static content? +16. Is traffic routed to the nearest region? +17. Are there known network bottlenecks? + +### Optimization +18. Is asynchronous processing used for long-running tasks? +19. Are batch operations used where appropriate? +20. Is content compressed for transfer? diff --git a/skills/azure-waf-assessment/references/report-template.md b/skills/azure-waf-assessment/references/report-template.md new file mode 100644 index 00000000..7871c85b --- /dev/null +++ b/skills/azure-waf-assessment/references/report-template.md @@ -0,0 +1,224 @@ +# WAF Assessment Report Template + +Copy and customize this template for each assessment. + +--- + +# Well-Architected Framework Assessment Report + +**Workload**: [Name] +**Assessment Date**: [Date] +**Assessor**: [Name/Team] +**Participants**: [Names and roles] +**Report Version**: 1.0 + +--- + +## Executive Summary + +### Overview +[2-3 sentences describing the workload and assessment scope] + +### Assessment Scope +- [ ] Full WAF Assessment (all 5 pillars) +- [ ] Pillar Deep Dive: [specify] +- [ ] Quick Health Check +- [ ] Pre-Production Review + +### Overall Maturity + +| Pillar | Score (1-5) | Trend | +|--------|-------------|-------| +| Reliability | X.X | ↑↓→ | +| Security | X.X | ↑↓→ | +| Cost Optimization | X.X | ↑↓→ | +| Operational Excellence | X.X | ↑↓→ | +| Performance Efficiency | X.X | ↑↓→ | +| **Overall** | **X.X** | | + +### Top Recommendations + +| # | Recommendation | Pillar | Priority | Effort | +|---|---------------|--------|----------|--------| +| 1 | [Action item] | [Pillar] | P1 | [S/M/L] | +| 2 | [Action item] | [Pillar] | P1 | [S/M/L] | +| 3 | [Action item] | [Pillar] | P2 | [S/M/L] | + +--- + +## Workload Context + +### Business Purpose +[What does this workload do? Who uses it?] + +### Criticality +- [ ] Mission Critical +- [ ] Business Critical +- [ ] Business Unit Critical +- [ ] Development/Test + +### Architecture Overview +[Insert or link to architecture diagram] + +**Key Components**: +- [Component 1]: [Description] +- [Component 2]: [Description] +- [Component 3]: [Description] + +**Azure Services Used**: +- [Service 1] +- [Service 2] +- [Service 3] + +--- + +## Pillar Assessments + +### Reliability + +**Score**: X.X / 5.0 + +#### Strengths +- [Strength 1] +- [Strength 2] + +#### Findings + +##### REL-001: [Finding Title] +- **Severity**: [Critical/High/Medium/Low] +- **Current State**: [Description] +- **Gap**: [What's missing] +- **Risk**: [Potential impact] +- **Recommendation**: [Action to take] +- **Effort**: [S/M/L] +- **Azure Service/Feature**: [Relevant service] + +##### REL-002: [Finding Title] +[Repeat structure] + +--- + +### Security + +**Score**: X.X / 5.0 + +#### Strengths +- [Strength 1] +- [Strength 2] + +#### Findings + +##### SEC-001: [Finding Title] +- **Severity**: [Critical/High/Medium/Low] +- **Current State**: [Description] +- **Gap**: [What's missing] +- **Risk**: [Potential impact] +- **Recommendation**: [Action to take] +- **Effort**: [S/M/L] +- **Azure Service/Feature**: [Relevant service] + +--- + +### Cost Optimization + +**Score**: X.X / 5.0 + +#### Strengths +- [Strength 1] +- [Strength 2] + +#### Findings + +##### COST-001: [Finding Title] +- **Severity**: [Critical/High/Medium/Low] +- **Current State**: [Description] +- **Gap**: [What's missing] +- **Risk**: [Potential impact] +- **Recommendation**: [Action to take] +- **Estimated Savings**: [$/month if applicable] +- **Effort**: [S/M/L] + +--- + +### Operational Excellence + +**Score**: X.X / 5.0 + +#### Strengths +- [Strength 1] +- [Strength 2] + +#### Findings + +##### OPS-001: [Finding Title] +- **Severity**: [Critical/High/Medium/Low] +- **Current State**: [Description] +- **Gap**: [What's missing] +- **Risk**: [Potential impact] +- **Recommendation**: [Action to take] +- **Effort**: [S/M/L] + +--- + +### Performance Efficiency + +**Score**: X.X / 5.0 + +#### Strengths +- [Strength 1] +- [Strength 2] + +#### Findings + +##### PERF-001: [Finding Title] +- **Severity**: [Critical/High/Medium/Low] +- **Current State**: [Description] +- **Gap**: [What's missing] +- **Risk**: [Potential impact] +- **Recommendation**: [Action to take] +- **Effort**: [S/M/L] + +--- + +## Recommendations Roadmap + +### Immediate (0-30 days) +| ID | Recommendation | Owner | Target Date | Status | +|----|---------------|-------|-------------|--------| +| SEC-001 | [Action] | [Name] | [Date] | Not Started | +| REL-001 | [Action] | [Name] | [Date] | Not Started | + +### Short-term (30-90 days) +| ID | Recommendation | Owner | Target Date | Status | +|----|---------------|-------|-------------|--------| +| COST-001 | [Action] | [Name] | [Date] | Not Started | +| OPS-001 | [Action] | [Name] | [Date] | Not Started | + +### Medium-term (90-180 days) +| ID | Recommendation | Owner | Target Date | Status | +|----|---------------|-------|-------------|--------| +| PERF-001 | [Action] | [Name] | [Date] | Not Started | + +--- + +## Appendix + +### A. Assessment Responses +[Full Q&A transcript or summary] + +### B. Architecture Diagrams +[Additional diagrams] + +### C. Reference Links +- [Azure Well-Architected Framework](https://learn.microsoft.com/azure/well-architected/) +- [WAF Assessment Tool](https://learn.microsoft.com/assessments/) +- [Azure Advisor](https://portal.azure.com/#blade/Microsoft_Azure_Expert/AdvisorMenuBlade) + +### D. Version History +| Version | Date | Author | Changes | +|---------|------|--------|---------| +| 1.0 | [Date] | [Name] | Initial assessment | + +--- + +**Next Assessment**: [Recommended date - typically 6-12 months] diff --git a/skills/azure-waf-assessment/references/scoring-guide.md b/skills/azure-waf-assessment/references/scoring-guide.md new file mode 100644 index 00000000..fccefe4b --- /dev/null +++ b/skills/azure-waf-assessment/references/scoring-guide.md @@ -0,0 +1,234 @@ +# Maturity Scoring Guide + +Consistent scoring criteria for Well-Architected Framework assessments. + +## Scoring Scale Overview + +| Level | Name | Description | +|-------|------|-------------| +| 1 | Ad Hoc | No defined process; reactive and inconsistent | +| 2 | Developing | Awareness exists; some practices emerging | +| 3 | Defined | Documented processes; consistently followed | +| 4 | Managed | Metrics tracked; continuous improvement | +| 5 | Optimized | Industry-leading; automated and proactive | + +--- + +## Reliability Scoring + +### Level 1 - Ad Hoc +- No defined SLA/SLO +- Single region deployment without redundancy +- No documented DR plan +- Backups not tested +- No resilience patterns implemented + +### Level 2 - Developing +- SLA defined but not measured +- Some redundancy within single region +- Basic backup in place +- DR plan exists but not tested +- Manual failover procedures + +### Level 3 - Defined +- SLA measured and reported +- Multi-AZ deployment for critical components +- Regular backup verification +- DR plan tested annually +- Retry policies implemented + +### Level 4 - Managed +- SLO tracking with automated alerts +- Active-passive multi-region for critical workloads +- Automated DR testing quarterly +- Circuit breakers and bulkheads in place +- Chaos engineering experiments conducted + +### Level 5 - Optimized +- Error budgets actively managed +- Active-active multi-region with automatic failover +- Continuous DR validation +- Self-healing architectures +- Proactive capacity management + +--- + +## Security Scoring + +### Level 1 - Ad Hoc +- Shared credentials or long-lived keys +- No network segmentation +- Secrets in code or config files +- No security monitoring +- Ad-hoc access grants + +### Level 2 - Developing +- Individual accounts but broad permissions +- Basic NSGs in place +- Secrets in Key Vault (some) +- Defender for Cloud enabled but not monitored +- Manual access reviews + +### Level 3 - Defined +- RBAC with principle of least privilege +- Network segmentation with NSGs and subnets +- All secrets in Key Vault +- Security alerts monitored during business hours +- Regular access reviews conducted + +### Level 4 - Managed +- Managed identities for all service-to-service auth +- Private endpoints for all PaaS services +- Automated secret rotation +- 24/7 security monitoring with defined SLAs +- Privileged Identity Management (PIM) enabled + +### Level 5 - Optimized +- Zero-trust architecture fully implemented +- Micro-segmentation with application-level controls +- Hardware security modules for critical keys +- Automated threat response +- Continuous compliance validation + +--- + +## Cost Optimization Scoring + +### Level 1 - Ad Hoc +- No visibility into workload costs +- No tagging strategy +- Resources provisioned without size justification +- No budgets defined +- Unused resources accumulate + +### Level 2 - Developing +- Total spend known but not by workload +- Basic tagging in place +- Some right-sizing done reactively +- Budgets defined but not monitored +- Occasional cleanup of unused resources + +### Level 3 - Defined +- Cost allocation by workload via tags +- Regular cost reviews (monthly) +- Right-sizing performed quarterly +- Budget alerts configured +- Reserved instances for stable workloads + +### Level 4 - Managed +- Real-time cost dashboards +- Automated anomaly detection +- Continuous right-sizing recommendations acted upon +- Showback/chargeback to business units +- Savings plans and RIs optimized + +### Level 5 - Optimized +- Cost integrated into architecture decisions +- FinOps practice established +- Automated resource lifecycle management +- Spot instances for appropriate workloads +- Continuous cost optimization culture + +--- + +## Operational Excellence Scoring + +### Level 1 - Ad Hoc +- Manual deployments via portal +- No source control for infrastructure +- No monitoring beyond basic Azure metrics +- No documented runbooks +- Ad-hoc incident response + +### Level 2 - Developing +- Some automation scripts +- Infrastructure partially in source control +- Basic alerting on availability +- Some documentation exists +- Defined escalation path + +### Level 3 - Defined +- IaC for all new deployments +- CI/CD pipelines for applications +- Comprehensive monitoring with dashboards +- Runbooks for common operations +- Incident management process defined + +### Level 4 - Managed +- All infrastructure as code +- Automated testing in CI/CD +- SLI/SLO tracking with automated alerts +- Post-incident reviews conducted +- Documentation kept current + +### Level 5 - Optimized +- GitOps with drift detection +- Continuous delivery with feature flags +- AIOps for proactive issue detection +- Learning culture with blameless post-mortems +- Self-service platforms for developers + +--- + +## Performance Efficiency Scoring + +### Level 1 - Ad Hoc +- No performance requirements defined +- Resources sized by guess +- No performance testing +- No caching strategy +- Synchronous processing everywhere + +### Level 2 - Developing +- Basic performance expectations +- Resources sized based on similar workloads +- Occasional load testing +- Some caching implemented +- Awareness of async patterns + +### Level 3 - Defined +- Performance SLOs defined +- Resources sized based on testing +- Load testing before major releases +- Caching strategy documented +- Async processing for long operations + +### Level 4 - Managed +- Performance monitored continuously +- Auto-scaling based on demand +- Regular performance testing in CI/CD +- CDN for static content +- Database query optimization regular + +### Level 5 - Optimized +- Performance engineering culture +- Predictive scaling based on patterns +- Performance budgets in development +- Edge computing where beneficial +- Continuous performance optimization + +--- + +## Calculating Overall Scores + +### Pillar Score +Average of individual finding scores within the pillar, weighted by severity: +- Critical findings: 0.5x multiplier on score +- High findings: 0.75x multiplier +- Medium findings: 1.0x multiplier +- Low findings: 1.0x multiplier + +### Overall Score +Average of all pillar scores, optionally weighted by business priority: +- Mission Critical: Weight reliability and security higher +- Cost Sensitive: Weight cost optimization higher +- High Velocity: Weight operational excellence higher + +### Score Interpretation +| Score Range | Assessment | Action | +|-------------|------------|--------| +| 1.0 - 1.9 | Critical gaps | Immediate remediation needed | +| 2.0 - 2.9 | Significant gaps | Prioritize improvements | +| 3.0 - 3.9 | Moderate maturity | Continuous improvement | +| 4.0 - 4.9 | High maturity | Fine-tuning and optimization | +| 5.0 | Exceptional | Maintain and share best practices | diff --git a/skills/azure-waf-review/SKILL.md b/skills/azure-waf-review/SKILL.md new file mode 100644 index 00000000..7f314586 --- /dev/null +++ b/skills/azure-waf-review/SKILL.md @@ -0,0 +1,156 @@ +--- +name: azure-waf-review +description: | + Review Azure architectures using the Well-Architected Framework (WAF) pillars. + Use when: + (1) Conducting architecture reviews for Azure workloads + (2) Identifying reliability, security, cost, or performance gaps + (3) Preparing for Azure Well-Architected Review assessments + (4) Evaluating existing architectures against best practices + (5) Creating remediation plans for architecture improvements + (6) Comparing design options using WAF principles + Triggers: Well-Architected, WAF, architecture review, reliability review, + security review, cost optimization, performance review, operational excellence +--- + +# Azure Architecture WAF Review + +Evaluate Azure architectures against the five pillars of the Well-Architected Framework. + +## The Five Pillars + +| Pillar | Focus | Key Question | +|--------|-------|--------------| +| **Reliability** | Resiliency, availability, recovery | Can the workload recover from failures? | +| **Security** | Protect data, systems, assets | Is the workload protected against threats? | +| **Cost Optimization** | Manage costs, maximize value | Is spending efficient and justified? | +| **Operational Excellence** | Operations, monitoring, DevOps | Can the team operate and improve the workload? | +| **Performance Efficiency** | Scalability, load handling | Does the workload meet performance demands? | + +## Review Process + +### 1. Scope Definition +- Identify workload boundaries +- Document current architecture (diagram) +- List dependencies and integrations +- Define business requirements (SLA, RTO, RPO) + +### 2. Pillar Assessment +For each pillar, evaluate: +- Current state against best practices +- Gaps and risks identified +- Impact (High/Medium/Low) +- Remediation complexity + +### 3. Prioritization +Use impact vs effort matrix: +``` + High Impact + │ + Quick Wins │ Major Projects + │ +Low Effort ───┼─── High Effort + │ + Fill-ins│ Deprioritize + │ + Low Impact +``` + +### 4. Recommendations +- Specific, actionable improvements +- Linked to WAF guidance +- Estimated effort and impact +- Implementation sequence + +## Quick Assessment Checklist + +### Reliability +- [ ] Multi-region or availability zone deployment? +- [ ] Defined RTO and RPO targets? +- [ ] Automated failover configured? +- [ ] Health probes and self-healing? +- [ ] Backup strategy tested? +- [ ] Chaos engineering practiced? + +### Security +- [ ] Zero-trust network model? +- [ ] Managed identities used? +- [ ] Secrets in Key Vault? +- [ ] Encryption at rest and in transit? +- [ ] WAF for internet-facing apps? +- [ ] Defender for Cloud enabled? + +### Cost Optimization +- [ ] Right-sized resources? +- [ ] Reserved instances for steady workloads? +- [ ] Auto-scaling configured? +- [ ] Orphaned resources cleaned up? +- [ ] Cost alerts and budgets set? +- [ ] Storage lifecycle policies? + +### Operational Excellence +- [ ] Infrastructure as Code? +- [ ] CI/CD pipelines? +- [ ] Centralized logging? +- [ ] Alerts and dashboards? +- [ ] Runbooks for common issues? +- [ ] Post-incident reviews conducted? + +### Performance Efficiency +- [ ] Appropriate service tiers? +- [ ] Caching strategy implemented? +- [ ] CDN for static content? +- [ ] Database query optimization? +- [ ] Load testing performed? +- [ ] Auto-scaling rules defined? + +## Severity Classification + +| Severity | Criteria | Action | +|----------|----------|--------| +| Critical | Active security risk or outage likely | Immediate remediation | +| High | Significant gap in WAF compliance | Remediate within 30 days | +| Medium | Best practice not followed | Plan for next quarter | +| Low | Optimization opportunity | Backlog for future | + +## Review Output Template + +```markdown +## Architecture Review: [Workload Name] + +**Date**: [Review Date] +**Reviewer**: [Name] +**Stakeholders**: [List] + +### Executive Summary +[2-3 sentence overview of findings] + +### Architecture Diagram +[Link to diagram] + +### Findings by Pillar + +#### Reliability (Score: X/5) +| Finding | Severity | Recommendation | +|---------|----------|----------------| +| [Gap] | [H/M/L] | [Action] | + +#### Security (Score: X/5) +... + +### Prioritized Recommendations +1. [Highest priority item] +2. [Second priority] +3. [Third priority] + +### Next Steps +- [Action items with owners and dates] +``` + +## References + +- **Reliability pillar**: See [references/reliability.md](references/reliability.md) +- **Security pillar**: See [references/security.md](references/security.md) +- **Cost optimization**: See [references/cost.md](references/cost.md) +- **Operational excellence**: See [references/operations.md](references/operations.md) +- **Performance efficiency**: See [references/performance.md](references/performance.md) diff --git a/skills/azure-waf-review/references/cost.md b/skills/azure-waf-review/references/cost.md new file mode 100644 index 00000000..c77d7201 --- /dev/null +++ b/skills/azure-waf-review/references/cost.md @@ -0,0 +1,233 @@ +# Cost Optimization Pillar - Deep Dive + +## Table of Contents +1. [Cost Visibility](#cost-visibility) +2. [Right-Sizing](#right-sizing) +3. [Reserved Capacity](#reserved-capacity) +4. [Architecture Optimization](#architecture-optimization) +5. [Operational Efficiency](#operational-efficiency) + +## Cost Visibility + +### Cost Management Setup +1. Enable cost analysis in Azure portal +2. Create budgets with alerts +3. Configure cost allocation tags +4. Set up scheduled reports +5. Use Cost Management APIs for automation + +### Budget Configuration +```bicep +resource budget 'Microsoft.Consumption/budgets@2023-11-01' = { + name: 'budget-${workload}-monthly' + properties: { + category: 'Cost' + amount: 10000 + timeGrain: 'Monthly' + timePeriod: { + startDate: '2026-01-01' + endDate: '2026-12-31' + } + notifications: { + notification50: { + enabled: true + threshold: 50 + operator: 'GreaterThan' + contactEmails: ['finance@company.com'] + } + notification90: { + enabled: true + threshold: 90 + operator: 'GreaterThan' + contactEmails: ['finance@company.com', 'engineering@company.com'] + contactRoles: ['Owner'] + } + } + } +} +``` + +### Required Cost Tags + +| Tag | Purpose | Enforcement | +|-----|---------|-------------| +| CostCenter | Chargeback/showback | Policy: Deny if missing | +| Environment | Dev/Prod cost split | Policy: Deny if missing | +| Owner | Accountability | Policy: Audit | +| Application | Workload attribution | Policy: Deny if missing | +| ExpirationDate | Temporary resources | Automation cleanup | + +## Right-Sizing + +### VM Right-Sizing Process +1. Enable Azure Monitor agent +2. Collect 30+ days of metrics +3. Analyze CPU, memory, disk, network +4. Use Azure Advisor recommendations +5. Test new size in non-production +6. Implement change with minimal downtime + +### Right-Sizing Guidelines + +| Metric | Action if Consistently... | +|--------|---------------------------| +| CPU < 20% | Downsize or use burstable | +| CPU > 80% | Upsize or scale out | +| Memory < 30% | Consider smaller SKU | +| Memory > 90% | Upsize | +| Disk IOPS < 20% capacity | Standard SSD or HDD | + +### Burstable VMs +Use B-series for: +- Development environments +- Small databases with spiky loads +- Test/QA workloads +- Low-traffic web servers + +### Azure Advisor Query +```kusto +AdvisorResources +| where type == "microsoft.advisor/recommendations" +| where properties.category == "Cost" +| project + resource = tostring(properties.resourceMetadata.resourceId), + savings = tostring(properties.extendedProperties.savingsAmount), + recommendation = tostring(properties.shortDescription.problem) +| order by savings desc +``` + +## Reserved Capacity + +### Reservation Strategy + +| Workload Pattern | Recommendation | +|------------------|----------------| +| Steady 24/7 | 3-year reservation (60% savings) | +| Predictable business hours | 1-year + spot for off-hours | +| Variable/unpredictable | Pay-as-you-go + auto-scaling | +| Dev/Test | Dev/Test pricing + auto-shutdown | + +### Services with Reservations + +| Service | Reservation Type | Max Savings | +|---------|------------------|-------------| +| Virtual Machines | Reserved Instances | 72% | +| SQL Database | Reserved Capacity | 55% | +| Cosmos DB | Reserved Capacity | 65% | +| Azure Synapse | Reserved Capacity | 65% | +| App Service | Reserved Instances | 55% | +| Azure Files | Reserved Capacity | 36% | +| Blob Storage | Reserved Capacity | 38% | + +### Reservation Scope Options +- **Shared**: Automatically applies across subscriptions +- **Single subscription**: Limits to one subscription +- **Single resource group**: Most restrictive +- **Management group**: Recommended for enterprises + +## Architecture Optimization + +### Serverless vs Always-On + +| Workload | Serverless Option | When to Use | +|----------|-------------------|-------------| +| API | Azure Functions | < 1M requests/month | +| Web | Static Web Apps | Static content + Functions | +| Data processing | Event-driven Functions | Batch/event workloads | +| Containers | Container Apps | Scale to zero needed | + +### Storage Optimization + +| Access Pattern | Storage Tier | Cost Relative | +|----------------|--------------|---------------| +| Frequent (daily) | Hot | 100% | +| Infrequent (monthly) | Cool | 50% | +| Rare (quarterly) | Cold | 30% | +| Archive (yearly) | Archive | 10% | + +### Lifecycle Policy +```bicep +resource storagePolicy 'Microsoft.Storage/storageAccounts/managementPolicies@2023-01-01' = { + name: 'default' + parent: storageAccount + properties: { + policy: { + rules: [ + { + name: 'archiveOldData' + type: 'Lifecycle' + definition: { + filters: { blobTypes: ['blockBlob'] } + actions: { + baseBlob: { + tierToCool: { daysAfterModificationGreaterThan: 30 } + tierToArchive: { daysAfterModificationGreaterThan: 90 } + delete: { daysAfterModificationGreaterThan: 365 } + } + } + } + } + ] + } + } +} +``` + +### Caching Strategy +- Use Azure Cache for Redis for session state +- Implement CDN for static content +- Use application-level caching +- Consider read replicas for databases + +## Operational Efficiency + +### Auto-Shutdown for Dev/Test +```bicep +resource autoShutdown 'Microsoft.DevTestLab/schedules@2018-09-15' = { + name: 'shutdown-computevm-${vmName}' + location: location + properties: { + status: 'Enabled' + taskType: 'ComputeVmShutdownTask' + dailyRecurrence: { time: '1900' } + timeZoneId: 'Pacific Standard Time' + targetResourceId: vm.id + } +} +``` + +### Spot VMs for Batch Workloads +- Up to 90% savings +- Can be evicted with 30-second notice +- Ideal for: batch processing, CI/CD agents, rendering + +```bicep +resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2023-09-01' = { + properties: { + virtualMachineProfile: { + priority: 'Spot' + evictionPolicy: 'Deallocate' + billingProfile: { + maxPrice: -1 // Up to on-demand price + } + } + } +} +``` + +### Cleanup Automation +Weekly script to identify: +- Unattached disks +- Unused public IPs +- Empty resource groups +- Stopped VMs (still incurring disk costs) +- Expired snapshots + +### Cost Review Cadence + +| Review | Frequency | Focus | +|--------|-----------|-------| +| Daily alerts | Automated | Budget threshold breaches | +| Weekly review | Team | Advisor recommendations | +| Monthly analysis | Finance | Trend analysis, forecasting | +| Quarterly optimization | Architecture | Major cost reduction initiatives | diff --git a/skills/azure-waf-review/references/operations.md b/skills/azure-waf-review/references/operations.md new file mode 100644 index 00000000..8ff8337e --- /dev/null +++ b/skills/azure-waf-review/references/operations.md @@ -0,0 +1,247 @@ +# Operational Excellence Pillar - Deep Dive + +## Table of Contents +1. [DevOps Practices](#devops-practices) +2. [Infrastructure as Code](#infrastructure-as-code) +3. [Monitoring and Observability](#monitoring-and-observability) +4. [Incident Management](#incident-management) +5. [Continuous Improvement](#continuous-improvement) + +## DevOps Practices + +### DevOps Maturity Model + +| Level | Characteristics | Target | +|-------|-----------------|--------| +| Initial | Manual processes, no version control | Move to Basic | +| Basic | Source control, some automation | 3 months | +| Intermediate | CI/CD, IaC, basic monitoring | 6 months | +| Advanced | Full automation, observability | 12 months | +| Optimized | Self-healing, predictive | Ongoing | + +### CI/CD Pipeline Requirements +```yaml +# Minimum viable pipeline +stages: + - Build + - Test (unit, integration) + - Security scan + - Deploy to dev + - Smoke tests + - Deploy to staging + - Integration tests + - Approval gate + - Deploy to production + - Health check +``` + +### Branching Strategy + +| Strategy | Best For | +|----------|----------| +| Trunk-based | Experienced teams, fast releases | +| GitFlow | Release cycles, multiple versions | +| GitHub Flow | Continuous deployment, SaaS | + +### Deployment Strategies + +| Strategy | Rollback Speed | Risk | Complexity | +|----------|----------------|------|------------| +| Blue-Green | Instant | Low | Medium | +| Canary | Fast | Medium | High | +| Rolling | Slow | Medium | Low | +| Feature Flags | Instant | Low | Medium | + +## Infrastructure as Code + +### IaC Requirements +- All infrastructure defined in code +- Version controlled in Git +- Peer-reviewed via pull requests +- Tested before deployment +- Immutable deployments preferred + +### IaC Tool Selection + +| Tool | Best For | Azure Integration | +|------|----------|-------------------| +| Bicep | Azure-native, simplicity | Native | +| Terraform | Multi-cloud, existing skills | Excellent | +| Pulumi | Developer-centric, programming languages | Good | +| ARM | Legacy, complex conditions | Native | + +### Deployment Validation +```bash +# Bicep +az deployment group what-if \ + --resource-group rg-prod \ + --template-file main.bicep + +# Terraform +terraform plan -out=tfplan +``` + +### State Management +- Store state remotely (Azure Storage for Terraform) +- Enable state locking +- Encrypt state at rest +- Backup state files +- Audit state changes + +## Monitoring and Observability + +### Three Pillars of Observability + +| Pillar | Azure Service | Purpose | +|--------|---------------|---------| +| Metrics | Azure Monitor Metrics | Quantitative health data | +| Logs | Log Analytics | Detailed diagnostic data | +| Traces | Application Insights | Request flow tracking | + +### Monitoring Hierarchy +``` +Business Metrics + │ + ▼ +Service Health (SLIs/SLOs) + │ + ▼ +Application Performance (APM) + │ + ▼ +Infrastructure Metrics + │ + ▼ +Resource Health +``` + +### Key Metrics by Service + +| Service | Critical Metrics | +|---------|------------------| +| Web App | Response time, error rate, throughput | +| SQL Database | DTU/vCore %, deadlocks, connections | +| Storage | Availability, latency, capacity | +| VM | CPU, memory, disk IOPS, network | +| AKS | Pod health, node status, resource pressure | + +### Alert Strategy +```bicep +resource alert 'Microsoft.Insights/metricAlerts@2018-03-01' = { + name: 'alert-high-cpu-${vmName}' + properties: { + severity: 2 + enabled: true + scopes: [vm.id] + evaluationFrequency: 'PT5M' + windowSize: 'PT15M' + criteria: { + 'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria' + allOf: [ + { + name: 'HighCPU' + metricName: 'Percentage CPU' + operator: 'GreaterThan' + threshold: 90 + timeAggregation: 'Average' + } + ] + } + actions: [ + { actionGroupId: actionGroup.id } + ] + } +} +``` + +### Dashboard Requirements +- Real-time service health +- Key business metrics +- Error rates and trends +- Resource utilization +- Cost trends +- Deployment status + +## Incident Management + +### Incident Severity Levels + +| Severity | Impact | Response Time | Example | +|----------|--------|---------------|---------| +| P1 | Complete outage | 15 minutes | Production down | +| P2 | Major degradation | 30 minutes | Core feature broken | +| P3 | Minor impact | 4 hours | Non-critical issue | +| P4 | Minimal impact | 24 hours | Cosmetic issue | + +### Incident Response Process +``` +Detection → Triage → Communication → Investigation → Resolution → Review + │ │ │ │ │ │ + │ │ │ │ │ │ +Automated Assign Status Root cause Implement Post- +alerts severity updates analysis fix mortem +``` + +### Runbook Template +```markdown +## Runbook: [Issue Name] + +### Symptoms +- [Observable symptoms] + +### Impact +- [Services affected] +- [Users affected] + +### Diagnostic Steps +1. [Step 1] +2. [Step 2] + +### Resolution Steps +1. [Step 1] +2. [Step 2] + +### Escalation +- [Team/person to escalate to] + +### Prevention +- [How to prevent recurrence] +``` + +### Post-Incident Review +- What happened? +- Timeline of events +- What went well? +- What could be improved? +- Action items (with owners and dates) +- Blameless culture + +## Continuous Improvement + +### Improvement Metrics + +| Metric | Target | Measurement | +|--------|--------|-------------| +| Deployment Frequency | Weekly → Daily | Deployments/week | +| Lead Time | Days → Hours | Commit to production | +| MTTR | Hours → Minutes | Incident to resolution | +| Change Failure Rate | < 15% | Failed/total deployments | + +### Technical Debt Management +- Allocate 20% of sprint capacity +- Maintain visible backlog +- Prioritize by impact and risk +- Track debt reduction metrics + +### Capacity Planning +- Review resource utilization monthly +- Forecast 3-6 months ahead +- Plan for peak events +- Test scaling before needed + +### Learning Culture +- Regular training sessions +- Lunch and learns +- Conference attendance +- Certification support +- Internal tech talks diff --git a/skills/azure-waf-review/references/performance.md b/skills/azure-waf-review/references/performance.md new file mode 100644 index 00000000..558cd764 --- /dev/null +++ b/skills/azure-waf-review/references/performance.md @@ -0,0 +1,297 @@ +# Performance Efficiency Pillar - Deep Dive + +## Table of Contents +1. [Scaling Strategies](#scaling-strategies) +2. [Compute Optimization](#compute-optimization) +3. [Data Tier Performance](#data-tier-performance) +4. [Network Performance](#network-performance) +5. [Performance Testing](#performance-testing) + +## Scaling Strategies + +### Scaling Patterns + +| Pattern | Description | When to Use | +|---------|-------------|-------------| +| Vertical (Scale Up) | Increase instance size | Quick fix, stateful apps | +| Horizontal (Scale Out) | Add more instances | Stateless apps, high availability | +| Auto-scaling | Dynamic based on metrics | Variable workloads | +| Manual | Scheduled or on-demand | Predictable patterns | + +### Auto-Scaling Configuration +```bicep +resource autoscale 'Microsoft.Insights/autoscalesettings@2022-10-01' = { + name: 'autoscale-${appServicePlan.name}' + location: location + properties: { + targetResourceUri: appServicePlan.id + enabled: true + profiles: [ + { + name: 'default' + capacity: { + minimum: '2' + maximum: '10' + default: '2' + } + rules: [ + { + metricTrigger: { + metricName: 'CpuPercentage' + metricResourceUri: appServicePlan.id + timeGrain: 'PT1M' + statistic: 'Average' + timeWindow: 'PT5M' + timeAggregation: 'Average' + operator: 'GreaterThan' + threshold: 70 + } + scaleAction: { + direction: 'Increase' + type: 'ChangeCount' + value: '1' + cooldown: 'PT5M' + } + } + { + metricTrigger: { + metricName: 'CpuPercentage' + metricResourceUri: appServicePlan.id + timeGrain: 'PT1M' + statistic: 'Average' + timeWindow: 'PT10M' + timeAggregation: 'Average' + operator: 'LessThan' + threshold: 30 + } + scaleAction: { + direction: 'Decrease' + type: 'ChangeCount' + value: '1' + cooldown: 'PT10M' + } + } + ] + } + ] + } +} +``` + +### Scaling Best Practices +- Scale out before scaling up +- Set appropriate cooldown periods +- Use multiple metrics for scaling decisions +- Test scaling under load +- Monitor for flapping (rapid scale up/down) + +## Compute Optimization + +### VM Size Selection + +| Series | Optimized For | Use Cases | +|--------|---------------|-----------| +| B | Burstable | Dev/test, small web | +| D | General purpose | Most workloads | +| E | Memory | Databases, caching | +| F | Compute | Batch, gaming | +| L | Storage | Big data, NoSQL | +| N | GPU | ML, rendering | + +### Container Performance +- Use premium storage for etcd (AKS) +- Set resource requests and limits +- Use horizontal pod autoscaler +- Implement pod disruption budgets +- Use node pools for workload isolation + +### Serverless Performance +- Minimize cold starts (Premium plan for Functions) +- Use connection pooling +- Implement async patterns +- Configure appropriate timeout values +- Monitor execution duration trends + +## Data Tier Performance + +### Database Performance Optimization + +| Layer | Optimization | +|-------|--------------| +| Application | Connection pooling, caching, async queries | +| Query | Indexes, query plans, parameterization | +| Database | Proper tier, partitioning, read replicas | +| Storage | Premium storage, proper IOPS allocation | + +### SQL Database Performance +```sql +-- Find missing indexes +SELECT + mig.index_handle, + mid.statement AS TableName, + mid.equality_columns, + mid.inequality_columns, + mid.included_columns, + migs.user_seeks * migs.avg_total_user_cost * (migs.avg_user_impact * 0.01) AS improvement_measure +FROM sys.dm_db_missing_index_group_stats AS migs +INNER JOIN sys.dm_db_missing_index_groups AS mig ON migs.group_handle = mig.index_group_handle +INNER JOIN sys.dm_db_missing_index_details AS mid ON mig.index_handle = mid.index_handle +ORDER BY improvement_measure DESC; +``` + +### Cosmos DB Performance +- Choose appropriate partition key +- Use indexing policies (exclude unused paths) +- Configure RU/s based on workload +- Use autoscale for variable workloads +- Implement retry logic for throttling + +### Caching Strategy +``` +Request → Cache Hit? → Yes → Return cached + │ + No + │ + ▼ + Query Database + │ + ▼ + Update Cache + │ + ▼ + Return Response +``` + +### Cache Options + +| Service | Use Case | Max Latency | +|---------|----------|-------------| +| Azure Cache for Redis | Session, distributed cache | < 1ms | +| CDN | Static content | < 10ms | +| Application cache | Frequently accessed data | < 0.1ms | +| Database cache | Query results | Varies | + +## Network Performance + +### Network Latency Targets + +| Tier | Latency | Strategy | +|------|---------|----------| +| < 1ms | In-region | Same VNet or peered | +| < 10ms | Cross-region | Proximity placement | +| < 50ms | Global | CDN, edge locations | +| < 100ms | Global with processing | Front Door + backend | + +### Proximity Placement Groups +```bicep +resource ppg 'Microsoft.Compute/proximityPlacementGroups@2023-09-01' = { + name: 'ppg-${workload}' + location: location + properties: { + proximityPlacementGroupType: 'Standard' + } +} + +resource vm 'Microsoft.Compute/virtualMachines@2023-09-01' = { + name: vmName + location: location + properties: { + proximityPlacementGroup: { id: ppg.id } + } +} +``` + +### Accelerated Networking +- Enable for all supported VMs +- Required for low-latency workloads +- Up to 30Gbps network bandwidth +- Reduced jitter and CPU usage + +### CDN Configuration +```bicep +resource cdnProfile 'Microsoft.Cdn/profiles@2023-05-01' = { + name: 'cdn-${workload}' + location: 'global' + sku: { name: 'Standard_Microsoft' } +} + +resource cdnEndpoint 'Microsoft.Cdn/profiles/endpoints@2023-05-01' = { + parent: cdnProfile + name: 'endpoint-${workload}' + location: 'global' + properties: { + origins: [ + { + name: 'origin' + hostName: storageAccount.properties.primaryEndpoints.blob + } + ] + isCompressionEnabled: true + contentTypesToCompress: [ + 'text/plain' + 'text/html' + 'text/css' + 'application/javascript' + 'application/json' + ] + queryStringCachingBehavior: 'IgnoreQueryString' + } +} +``` + +## Performance Testing + +### Load Testing Strategy +1. Establish baseline metrics +2. Define performance targets (SLOs) +3. Create realistic test scenarios +4. Execute tests incrementally +5. Analyze results and bottlenecks +6. Optimize and repeat + +### Azure Load Testing +```yaml +# JMeter test configuration +testPlan: 'loadtest.jmx' +engineInstances: 5 +quickStartTest: false +configurationFiles: + - 'user.properties' +failureCriteria: + - avg(response_time_ms) > 500 + - percentage(error) > 10 +``` + +### Performance Targets + +| Metric | Web API | Batch Job | Interactive | +|--------|---------|-----------|-------------| +| Response time (p95) | < 500ms | N/A | < 200ms | +| Throughput | > 1000 RPS | Max capacity | > 100 RPS | +| Error rate | < 0.1% | < 1% | < 0.1% | +| Availability | 99.9% | 99% | 99.95% | + +### Bottleneck Identification + +| Symptom | Likely Bottleneck | Investigation | +|---------|-------------------|---------------| +| High CPU | Application code | Profiler, App Insights | +| High memory | Memory leak, caching | Memory dumps, metrics | +| High latency | Database, network | Query analysis, tracing | +| High disk I/O | Storage tier | Storage metrics, IOPS | +| Connection errors | Pool exhaustion | Connection metrics | + +### Application Insights Performance +```kusto +// Identify slow requests +requests +| where timestamp > ago(1h) +| where success == true +| summarize + p50 = percentile(duration, 50), + p95 = percentile(duration, 95), + p99 = percentile(duration, 99) + by name +| where p95 > 1000 +| order by p95 desc +``` diff --git a/skills/azure-waf-review/references/reliability.md b/skills/azure-waf-review/references/reliability.md new file mode 100644 index 00000000..dccfbd11 --- /dev/null +++ b/skills/azure-waf-review/references/reliability.md @@ -0,0 +1,256 @@ +# Reliability Pillar - Deep Dive + +## Table of Contents +1. [Design Principles](#design-principles) +2. [Availability Patterns](#availability-patterns) +3. [Disaster Recovery](#disaster-recovery) +4. [Data Resilience](#data-resilience) +5. [Testing Reliability](#testing-reliability) + +## Design Principles + +### Design for Failure +- Assume components will fail +- Implement redundancy at every layer +- Use health probes and circuit breakers +- Design for graceful degradation + +### Self-Healing +- Automatic restart on failure +- Health endpoint monitoring +- Replace unhealthy instances automatically +- Queue-based load leveling + +### Scale Out +- Prefer horizontal over vertical scaling +- Design stateless applications +- Use managed services where possible +- Implement auto-scaling + +## Availability Patterns + +### Availability Targets + +| SLA | Monthly Downtime | Architecture | +|-----|------------------|--------------| +| 99% | 7.3 hours | Single instance | +| 99.9% | 43.8 minutes | Availability Set | +| 99.95% | 21.9 minutes | Availability Zones | +| 99.99% | 4.3 minutes | Multi-region active | + +### Availability Zones +``` +Azure Region +┌─────────────────────────────────────────────────┐ +│ │ +│ Zone 1 Zone 2 Zone 3 │ +│ ┌───────┐ ┌───────┐ ┌───────┐ │ +│ │ VM │ │ VM │ │ VM │ │ +│ └───────┘ └───────┘ └───────┘ │ +│ │ │ │ │ +│ └──────────────┼──────────────┘ │ +│ │ │ +│ ┌──────▼──────┐ │ +│ │ Load │ │ +│ │ Balancer │ │ +│ └─────────────┘ │ +└─────────────────────────────────────────────────┘ +``` + +### Zone-Redundant Services + +| Service | Zone-Redundant Option | +|---------|----------------------| +| VMs | Zone-redundant VMSS | +| Storage | ZRS or GZRS | +| SQL Database | Zone-redundant deployment | +| AKS | Zone-redundant node pools | +| App Service | Zone-redundant App Service Plan | +| Key Vault | Automatic (Premium) | + +### Multi-Region Active-Active +``` + ┌─────────────────┐ + │ Traffic Manager│ + │ or Front Door │ + └────────┬────────┘ + │ + ┌──────────────┼──────────────┐ + ▼ │ ▼ + ┌──────────┐ │ ┌──────────┐ + │ East US │ │ │ West US │ + │ Region │◄─────────┼──────►│ Region │ + └──────────┘ Data │ └──────────┘ + Sync │ +``` + +## Disaster Recovery + +### RTO and RPO Planning + +| Tier | RTO | RPO | Strategy | +|------|-----|-----|----------| +| Mission Critical | < 1 hour | < 1 minute | Active-active multi-region | +| Business Critical | < 4 hours | < 1 hour | Active-passive with hot standby | +| Standard | < 24 hours | < 4 hours | Active-passive with warm standby | +| Low Priority | < 72 hours | < 24 hours | Backup and restore | + +### DR Patterns + +#### Pilot Light +``` +Primary Region (Active) Secondary Region (Pilot) +┌────────────────────┐ ┌────────────────────┐ +│ Full deployment │ │ Minimal resources │ +│ - Web tier │ │ - Scaled down VMs │ +│ - App tier │ │ - Database replica│ +│ - Database │ ──────► │ - Scripts ready │ +└────────────────────┘ Async └────────────────────┘ + Repl +``` + +#### Warm Standby +- Secondary region runs at reduced capacity +- Scale up during failover +- RTO: 15-60 minutes +- Higher cost than pilot light + +#### Hot Standby +- Full deployment in secondary +- Load balanced across regions +- RTO: < 15 minutes +- Highest cost, highest availability + +### Azure Site Recovery +```bicep +resource recoveryVault 'Microsoft.RecoveryServices/vaults@2023-06-01' = { + name: 'rsv-${workload}-${environment}' + location: secondaryLocation + sku: { name: 'RS0', tier: 'Standard' } + properties: {} +} +``` + +## Data Resilience + +### Storage Redundancy Options + +| Option | Protection | Use Case | +|--------|------------|----------| +| LRS | 3 copies in datacenter | Dev/test, easily recreated data | +| ZRS | 3 copies across zones | High availability in single region | +| GRS | 6 copies across regions | DR with manual failover | +| GZRS | ZRS + GRS | Maximum durability | +| RA-GRS | GRS + read access | Read-heavy workloads with DR | +| RA-GZRS | GZRS + read access | Maximum availability and durability | + +### Database High Availability + +#### SQL Database +```bicep +resource sqlDb 'Microsoft.Sql/servers/databases@2023-05-01-preview' = { + name: 'sqldb-${workload}' + properties: { + zoneRedundant: true + readScale: 'Enabled' + highAvailabilityReplicaCount: 2 + } +} +``` + +#### Cosmos DB +```bicep +resource cosmosDb 'Microsoft.DocumentDB/databaseAccounts@2023-11-15' = { + properties: { + enableMultipleWriteLocations: true + locations: [ + { locationName: 'eastus', failoverPriority: 0 } + { locationName: 'westus', failoverPriority: 1 } + ] + backupPolicy: { + type: 'Continuous' + continuousModeProperties: { tier: 'Continuous7Days' } + } + } +} +``` + +### Backup Strategy + +| Resource | Backup Method | Retention | +|----------|---------------|-----------| +| VMs | Azure Backup | 30 days daily, 12 months monthly | +| SQL Database | Automated backups | 7-35 days PITR | +| Storage | Soft delete + versioning | 30 days | +| Cosmos DB | Continuous backup | 7-30 days | +| AKS | Velero or Azure Backup | 7 days | + +## Testing Reliability + +### Chaos Engineering +Test failure scenarios: +- VM/instance failures +- Network latency injection +- Zone failures +- Dependency failures +- DNS failures + +### Azure Chaos Studio +```bicep +resource chaosExperiment 'Microsoft.Chaos/experiments@2023-11-01' = { + name: 'exp-vm-shutdown' + location: location + identity: { type: 'SystemAssigned' } + properties: { + selectors: [ + { + type: 'List' + id: 'selector1' + targets: [{ type: 'ChaosTarget', id: vmChaosTarget.id }] + } + ] + steps: [ + { + name: 'Step 1' + branches: [ + { + name: 'Branch 1' + actions: [ + { + type: 'continuous' + name: 'urn:csci:microsoft:virtualMachine:shutdown/1.0' + duration: 'PT5M' + selectorId: 'selector1' + } + ] + } + ] + } + ] + } +} +``` + +### DR Testing Schedule + +| Test Type | Frequency | Scope | +|-----------|-----------|-------| +| Backup restore | Monthly | Sample data | +| Failover drill | Quarterly | Non-production | +| Full DR test | Annually | Production (planned) | +| Chaos experiments | Continuous | Controlled | + +### Health Monitoring +```bicep +resource healthCheck 'Microsoft.Network/applicationGateways/probes@2023-05-01' = { + name: 'health-probe' + properties: { + protocol: 'Https' + path: '/health' + interval: 30 + timeout: 30 + unhealthyThreshold: 3 + pickHostNameFromBackendHttpSettings: true + } +} +``` diff --git a/skills/azure-waf-review/references/security.md b/skills/azure-waf-review/references/security.md new file mode 100644 index 00000000..f93773ed --- /dev/null +++ b/skills/azure-waf-review/references/security.md @@ -0,0 +1,223 @@ +# Security Pillar - Deep Dive + +## Table of Contents +1. [Zero Trust Model](#zero-trust-model) +2. [Identity Security](#identity-security) +3. [Network Security](#network-security) +4. [Data Protection](#data-protection) +5. [Application Security](#application-security) + +## Zero Trust Model + +### Principles +1. **Verify explicitly** - Always authenticate and authorize +2. **Use least privilege** - Limit access with JIT/JEA +3. **Assume breach** - Minimize blast radius, segment access + +### Implementation Layers +``` +┌─────────────────────────────────────────────────────────┐ +│ Identity Layer │ +│ Azure AD, Conditional Access, MFA, PIM │ +├─────────────────────────────────────────────────────────┤ +│ Device Layer │ +│ Intune, Compliant Devices, Health Attestation │ +├─────────────────────────────────────────────────────────┤ +│ Network Layer │ +│ NSGs, Azure Firewall, Private Link, DDoS │ +├─────────────────────────────────────────────────────────┤ +│ Application Layer │ +│ WAF, API Management, App Service Auth │ +├─────────────────────────────────────────────────────────┤ +│ Data Layer │ +│ Encryption, Classification, DLP, Rights Mgmt │ +└─────────────────────────────────────────────────────────┘ +``` + +## Identity Security + +### Authentication Best Practices + +| Method | Security Level | Use Case | +|--------|----------------|----------| +| Password only | ❌ Low | Never recommended | +| Password + MFA | ⚠️ Medium | Minimum for users | +| Passwordless | ✅ High | Preferred for users | +| Managed Identity | ✅ High | Azure workloads | +| Certificate | ✅ High | Service principals | + +### Managed Identity Configuration +```bicep +// System-assigned identity +resource appService 'Microsoft.Web/sites@2023-01-01' = { + name: appServiceName + identity: { + type: 'SystemAssigned' + } +} + +// Grant access to Key Vault +resource keyVaultAccess 'Microsoft.Authorization/roleAssignments@2022-04-01' = { + name: guid(keyVault.id, appService.id, keyVaultSecretsUser) + scope: keyVault + properties: { + roleDefinitionId: subscriptionResourceId( + 'Microsoft.Authorization/roleDefinitions', + '4633458b-17de-408a-b874-0445c86b69e6' // Key Vault Secrets User + ) + principalId: appService.identity.principalId + principalType: 'ServicePrincipal' + } +} +``` + +### Service Principal Security +- ❌ Avoid client secrets (rotate every 90 days if used) +- ✅ Use certificate credentials +- ✅ Use federated credentials for CI/CD +- ✅ Limit permissions to minimum required +- ✅ Monitor sign-in logs for anomalies + +## Network Security + +### Network Segmentation +``` +Internet + │ + ▼ +┌─────────────────────────────────────────────┐ +│ DMZ / Perimeter │ +│ - Application Gateway with WAF │ +│ - Azure Front Door │ +│ - DDoS Protection │ +└────────────────┬────────────────────────────┘ + │ +┌────────────────▼────────────────────────────┐ +│ Application Tier (Spoke VNet) │ +│ - NSG: Allow 443 from AGW only │ +│ - Private endpoints for PaaS │ +└────────────────┬────────────────────────────┘ + │ +┌────────────────▼────────────────────────────┐ +│ Data Tier (Spoke VNet) │ +│ - NSG: Allow from App tier only │ +│ - No public endpoints │ +│ - Private endpoints for databases │ +└─────────────────────────────────────────────┘ +``` + +### NSG Rule Priorities + +| Priority | Rule Type | Example | +|----------|-----------|---------| +| 100-199 | Essential allows | Allow health probes | +| 200-299 | Application allows | Allow HTTPS from known sources | +| 300-399 | Management allows | Allow Bastion, Azure management | +| 4000-4095 | Explicit denies | Deny specific threats | +| 4096 | Default deny all | Block everything else | + +### Private Endpoint Checklist +- [ ] Storage accounts - blob, file, table, queue +- [ ] Key Vault +- [ ] SQL Database +- [ ] Cosmos DB +- [ ] Container Registry +- [ ] App Configuration +- [ ] Service Bus / Event Hubs +- [ ] Azure Cache for Redis + +## Data Protection + +### Encryption Requirements + +| Requirement | Implementation | +|-------------|----------------| +| Encryption at rest | Azure Storage Service Encryption (default) | +| Customer-managed keys | Key Vault + CMK configuration | +| Encryption in transit | TLS 1.2 minimum enforced | +| Database encryption | TDE enabled (default for SQL) | +| Disk encryption | Azure Disk Encryption or server-side encryption | + +### Key Vault Best Practices +```bicep +resource keyVault 'Microsoft.KeyVault/vaults@2023-07-01' = { + name: keyVaultName + properties: { + tenantId: subscription().tenantId + sku: { family: 'A', name: 'premium' } // HSM-backed keys + + // Security settings + enableRbacAuthorization: true // Use RBAC, not access policies + enablePurgeProtection: true // NEVER set to false + enableSoftDelete: true + softDeleteRetentionInDays: 90 + + // Network isolation + publicNetworkAccess: 'Disabled' + networkAcls: { + defaultAction: 'Deny' + bypass: 'AzureServices' + } + } +} +``` + +### Secret Management Rules +1. **Never** store secrets in code or config files +2. **Never** log secrets (mask in all outputs) +3. **Always** use Key Vault references +4. **Rotate** secrets on schedule (90 days max) +5. **Audit** all secret access + +## Application Security + +### OWASP Top 10 Mitigations + +| Risk | Azure Mitigation | +|------|------------------| +| Injection | Parameterized queries, WAF SQL rules | +| Broken Auth | Azure AD, MFA, token validation | +| Sensitive Data | TLS, encryption, Key Vault | +| XXE | WAF, input validation | +| Broken Access | Azure RBAC, attribute-based access | +| Misconfiguration | Azure Policy, Security Center | +| XSS | WAF, Content-Security-Policy | +| Insecure Deserialization | Input validation, WAF | +| Vulnerable Components | Defender for containers, Dependabot | +| Logging Failures | Azure Monitor, Log Analytics | + +### Web Application Firewall +```bicep +resource wafPolicy 'Microsoft.Network/ApplicationGatewayWebApplicationFirewallPolicies@2023-05-01' = { + name: 'waf-${workload}' + properties: { + policySettings: { + mode: 'Prevention' + state: 'Enabled' + requestBodyCheck: true + maxRequestBodySizeInKb: 128 + fileUploadLimitInMb: 100 + } + managedRules: { + managedRuleSets: [ + { + ruleSetType: 'OWASP' + ruleSetVersion: '3.2' + } + { + ruleSetType: 'Microsoft_BotManagerRuleSet' + ruleSetVersion: '1.0' + } + ] + } + } +} +``` + +### Container Security +- Use private container registry (ACR) +- Enable vulnerability scanning +- Use minimal base images (distroless) +- Never run as root +- Implement pod security policies +- Enable Defender for Containers