Series/Learn State-of-the-Art GitOps/IaC Pipeline

Build CoreOrdered learning track

Production-Grade IaC Module System Design

Learn State-of-the-Art GitOps/IaC Pipeline - Part 009

Production-grade IaC module system design: module boundaries, API contracts, versioning, provider handling, composition, migration, testing, policy compatibility, and failure modes.

[2026-07-03]22 min read4241 words

In This Lesson

1. The Core Idea: A Module Is an API, Not a Folder 2. Module Design Starts from Capability Boundaries 3. Four Module Layers

PrevNext

Lesson 0940 lesson track09–22 Build Core

#gitops#iac#terraform#opentofu+5 more

Part 009 — Production-Grade IaC Module System Design

A weak IaC module system looks productive for the first six months.

Then every team wants a special case.

One module grows thirty boolean flags. Another module leaks provider details through outputs. Another has a create_everything = true mode. Nobody knows whether changing a variable replaces production infrastructure. A module upgrade looks small in Git but recreates a database. Teams pin random commits. Security asks whether all buckets are encrypted, but the answer requires reading fifty modules and three hundred environment overlays.

That is not a tooling problem.

It is a module system design problem.

A production-grade IaC module is not just reusable configuration. It is a stable infrastructure API over unsafe provider primitives.

That one sentence should change how you design it.

A provider resource exposes what the cloud can do. A module should expose what your platform allows, supports, audits, and can safely evolve.

This part builds the mental model and design rules for modules that survive real production pressure.

We are not learning module syntax from zero. You already know how to write a module block. We are learning how to decide what a module is allowed to mean.

1. The Core Idea: A Module Is an API, Not a Folder

A beginner sees a module as a folder with variables and outputs.

A production engineer sees a module as a contract.

That contract says:

Contract Area	Question
Intent	What infrastructure capability does this represent?
Ownership	Who owns the lifecycle of resources created by this module?
Inputs	What decisions may consumers make?
Defaults	What does the platform decide on behalf of consumers?
Outputs	What stable facts may other stacks depend on?
Security	Which controls are enforced internally?
Policy	Which organizational rules are encoded or exposed for validation?
State	Which resources share fate and state boundary?
Upgrade	What can change without breaking consumers?
Migration	How do consumers move between versions safely?
Evidence	What can auditors and operators prove from usage?

A module is therefore closer to a library API than a code snippet.

The worst module design mistake is to expose every underlying provider option because “flexibility is good.”

Flexibility at the wrong abstraction layer is not power. It is an unreviewed escape hatch.

A platform module should make the safe path short and the dangerous path explicit.

2. Module Design Starts from Capability Boundaries

Before writing variables, ask:

What capability is this module responsible for?

Do not start from provider resources.

Start from domain capability.

Weak module names:

aws_s3_bucket_wrapper
eks_all
networking
rds_stuff
common_resources

Stronger module names:

object_storage_bucket
private_service_network
postgres_database_instance
workload_identity_binding
http_service_deployment
tenant_runtime_namespace

The stronger names describe what the consumer receives, not which provider resources happen to implement it.

That matters because provider resources change, but platform capabilities should remain stable.

A module boundary is strong when the consumer can explain why they need it without knowing the internal provider resources.

3. Four Module Layers

Most teams mix different abstraction levels in one module system. That is why the system becomes inconsistent.

Use four layers.

3.1 Layer 1 — Primitive Wrapper

A primitive wrapper is a thin wrapper around provider resources.

Example:

aws_s3_bucket_secure_base
aws_iam_role_base
kubernetes_namespace_base

Use sparingly.

Primitive wrappers are useful when you need standard tags, encryption defaults, provider quirks, naming normalization, or repeated safety settings.

They are dangerous when they pretend to be high-level platform APIs.

Good primitive wrapper:

module "bucket_base" {
  source = "git::ssh://git.example.com/platform/iac-modules.git//aws/s3-bucket-base?ref=v1.4.2"

  name              = local.bucket_name
  kms_key_arn       = var.kms_key_arn
  force_destroy     = false
  block_public_acls = true
  tags              = local.tags
}

Bad primitive wrapper:

module "bucket" {
  source = "./bucket"

  enable_public_access       = var.enable_public_access
  enable_private_access      = var.enable_private_access
  enable_logging             = var.enable_logging
  enable_replication         = var.enable_replication
  enable_website_hosting     = var.enable_website_hosting
  enable_random_special_case = var.enable_random_special_case
}

The bad one is not a capability. It is a provider resource wearing a costume.

3.2 Layer 2 — Opinionated Capability Module

This is the most important layer for platform engineering.

It represents a supported infrastructure capability:

encrypted object bucket;
private Postgres instance;
service account with cloud identity binding;
event topic with dead-letter queue;
namespace with quotas and default policies;
service ingress with TLS and WAF posture.

This module hides unsafe provider detail and exposes business-relevant choices.

Example consumer interface:

module "orders_events" {
  source  = "app.terraform.io/acme/event-topic/platform"
  version = "~> 3.2"

  name             = "orders-events"
  owner_team       = "order-platform"
  data_class       = "internal"
  retention_days   = 14
  consumer_groups  = ["billing", "fulfillment"]
  environment      = var.environment
}

Notice what is missing:

no raw encryption toggle;
no arbitrary IAM JSON;
no random provider-specific internal ID;
no allow_unencrypted = true;
no skip_policy = true.

The module decides the baseline. The consumer chooses within the supported envelope.

3.3 Layer 3 — Product or Service Blueprint

A blueprint composes capabilities for a common product shape.

Example:

module "service_runtime" {
  source  = "git::ssh://git.example.com/platform/blueprints.git//http-service?ref=v2.6.0"

  service_name       = "quote-api"
  owner_team         = "cpq-platform"
  runtime_tier       = "standard"
  database_profile   = "postgres-small"
  eventing_profile   = "kafka-standard"
  expose_publicly    = false
  environment        = var.environment
}

A blueprint may create:

namespace;
service account;
workload identity binding;
default network policy;
secrets references;
database claim;
event topic claim;
observability dashboard registration;
deployment manifests.

Blueprints are powerful, but risky.

If too broad, they couple unrelated lifecycles. A deployment namespace may be safe to recreate. A database is not. A topic may have retention semantics. A workload identity may be reused by multiple deploys.

A good blueprint composes stable capabilities but does not hide irreversible lifecycle risks.

3.4 Layer 4 — Environment Stack

The environment stack is not a reusable module. It is the composition root.

It binds:

exact module versions;
exact provider versions;
account/project/subscription;
region;
environment;
remote state dependencies;
policy context;
credentials and runner identity.

Example:

module "orders_events" {
  source  = "git::ssh://git.example.com/platform/iac-modules.git//event-topic?ref=v3.2.4"

  name           = "orders-events"
  owner_team     = "order-platform"
  data_class     = "internal"
  retention_days = 14
  environment    = local.environment
}

The stack is where you should be explicit.

Reusable modules should reduce accidental complexity. Environment stacks should preserve operational clarity.

4. The Most Important Rule: Module Boundary Must Match Lifecycle Boundary

A module should group resources that usually change together, fail together, and are owned together.

If two resources have different lifecycles, they probably should not be hidden behind one atomic module interface.

Ask these questions:

Can this resource be replaced safely together with the others?
Does the same team own its lifecycle?
Does it require the same approval level?
Does it share the same state backend?
Does it have the same rollback strategy?
Does it have the same data durability requirement?
Would consumers expect it to exist independently?

Examples:

Module Idea	Usually Good?	Reason
Bucket + bucket encryption + bucket policy	Yes	Same capability and lifecycle
Kubernetes namespace + default quota + baseline network policy	Yes	Same tenancy boundary
App deployment + database	Usually no	Different lifecycle and rollback semantics
VPC + all databases + all services	No	Massive blast radius
Topic + dead-letter queue	Often yes	Same messaging capability
IAM role + every permission the app might ever need	No	Unbounded privilege growth

The production module designer is allergic to lifecycle ambiguity.

5. Inputs: Expose Decisions, Not Implementation Details

A module input should represent a decision the consumer is allowed to make.

Do not expose an input just because the provider has a parameter.

Classify every input.

Input Type	Example	Should Consumer Control It?
Identity	`name`, `owner_team`, `service_id`	Usually yes
Classification	`data_class`, `criticality`, `internet_facing`	Yes, because policy depends on it
Capacity	`size`, `retention_days`, `replica_count`	Yes, within bounds
Environment Context	`environment`, `region`, `account_id`	Often passed by stack, not app team
Security Baseline	encryption, TLS, public ACL block	Usually no; enforce internally
Escape Hatch	raw IAM JSON, custom security group rules	Dangerous; require explicit exception model
Provider Internals	resource IDs, API quirks	Usually no

Bad input:

variable "enable_encryption" {
  type    = bool
  default = true
}

Why is this bad?

Because it suggests encryption is optional.

Better:

variable "kms_key_policy" {
  type        = string
  description = "Controls which managed encryption key class is used. Allowed: platform, team-managed, regulated."
  validation {
    condition     = contains(["platform", "team-managed", "regulated"], var.kms_key_policy)
    error_message = "kms_key_policy must be platform, team-managed, or regulated."
  }
}

Even better for most teams:

variable "data_class" {
  type        = string
  description = "Data classification used to select encryption, retention, logging, and access policy."
  validation {
    condition     = contains(["public", "internal", "confidential", "regulated"], var.data_class)
    error_message = "Unsupported data_class."
  }
}

The consumer describes the risk. The module chooses the controls.

That is a platform API.

6. Defaults Are Policy Decisions

A default is not merely convenience.

A default is a decision that applies when the consumer does not think.

That makes defaults one of the most important control surfaces in IaC.

Bad defaults:

variable "publicly_accessible" {
  type    = bool
  default = true
}

variable "deletion_protection" {
  type    = bool
  default = false
}

These defaults optimize for demo success and production incidents.

Good defaults:

variable "public_exposure" {
  type    = string
  default = "private"

  validation {
    condition     = contains(["private", "internal", "public-approved"], var.public_exposure)
    error_message = "public_exposure must be private, internal, or public-approved."
  }
}

variable "deletion_protection" {
  type    = bool
  default = true
}

Even better: for regulated resources, do not expose deletion protection at all. Encode it based on classification.

locals {
  deletion_protection = contains(["confidential", "regulated"], var.data_class) ? true : var.allow_delete_for_non_prod
}

A strong module makes safe behavior the path of least resistance.

7. Outputs: Export Stable Facts, Not Internals

Outputs are not harmless.

An output becomes another stack's dependency.

Once consumers depend on it, removing or changing it is a breaking API change.

Output only facts that are stable and meaningful at the capability boundary.

Good outputs:

output "bucket_name" {
  value       = aws_s3_bucket.this.bucket
  description = "Stable bucket name for application configuration."
}

output "write_policy_arn" {
  value       = aws_iam_policy.write.arn
  description = "Policy ARN granting write access to this bucket."
}

output "audit_resource_id" {
  value       = local.audit_resource_id
  description = "Stable ID used in audit evidence and ownership inventory."
}

Risky outputs:

output "everything" {
  value = aws_s3_bucket.this
}

Why risky?

Because it leaks provider internals and gives consumers accidental dependencies on implementation details.

The moment another stack reads module.bucket.everything.id, your module internals are no longer private.

Use outputs as a published API.

8. Versioning: Pin the Contract, Not the Mood

Terraform and OpenTofu support version constraints for providers and registry modules. OpenTofu documentation describes version constraint strings as ranges of acceptable versions for modules, providers, and OpenTofu itself. Terraform module documentation recommends explicitly constraining acceptable module versions to avoid unexpected or unwanted changes.

In production, versioning should answer three questions:

Which module version does this stack use?
Which provider version was used to compute and apply the plan?
Which engine version was used?

A professional environment stack pins all three.

terraform {
  required_version = "~> 1.8.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.70"
    }
  }
}

module "orders_bucket" {
  source  = "app.terraform.io/acme/object-storage/platform"
  version = "~> 2.4"

  name       = "orders-archive"
  data_class = "confidential"
}

For Git-sourced modules, pin tags or immutable references.

module "orders_bucket" {
  source = "git::ssh://git.example.com/platform/iac-modules.git//object-storage?ref=v2.4.3"
}

Avoid branch refs in production:

# Avoid for production
source = "git::ssh://git.example.com/platform/iac-modules.git//object-storage?ref=main"

A branch ref makes the same Git commit in the environment repo mean different infrastructure behavior depending on when the pipeline runs.

That violates reproducibility.

8.1 Semantic Versioning for Modules

Use semantic versioning as a communication protocol:

Change Type	Example	Version Impact
Add optional input with safe default	`enable_access_logs` default true	Minor
Add output	`audit_resource_id`	Minor
Change default behavior	default retention from 7 to 30 days	Major or explicit migration
Rename input	`team` → `owner_team`	Major unless alias preserved
Remove output	remove `bucket_arn`	Major
Replace resource implementation	S3 bucket → provider abstraction	Major if state migration needed
Tighten validation	reject previously accepted value	Major if existing users break
Internal refactor no plan diff	locals cleanup	Patch

The version number should tell consumers how much thinking is required.

If every release is v1.0.0, you have no contract.

8.2 Compatibility Matrix

Every serious module should declare compatibility.

Example:

# Compatibility

| Module Version | OpenTofu/Terraform | AWS Provider | Notes |
|---|---|---|---|
| 2.x | >=1.7, <1.10 | >=5.60, <6.0 | Current production line |
| 1.x | >=1.4, <1.8 | >=4.50, <5.0 | Security fixes only |

This avoids hidden upgrade traps.

9. Provider Handling: Declare Requirements, Do Not Secretly Configure Providers

A reusable module should declare provider requirements. The root module should configure provider instances.

Terraform documentation explains that each module must declare its own provider requirements so the engine can select a single compatible provider version across the configuration. Provider configurations themselves are shared from the root unless passed explicitly.

Good reusable module:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.60, < 6.0"
    }
  }
}

Bad reusable module:

provider "aws" {
  region = "us-east-1"
}

Why bad?

Because the reusable module silently decides where resources are created. That belongs to the root stack.

9.1 Provider Aliases

For multi-region or cross-account modules, provider aliases are legitimate.

Example root stack:

provider "aws" {
  alias  = "primary"
  region = "us-east-1"

  assume_role {
    role_arn = local.primary_role_arn
  }
}

provider "aws" {
  alias  = "replica"
  region = "us-west-2"

  assume_role {
    role_arn = local.replica_role_arn
  }
}

module "replicated_bucket" {
  source = "git::ssh://git.example.com/platform/iac-modules.git//replicated-object-storage?ref=v1.8.0"

  providers = {
    aws.primary = aws.primary
    aws.replica = aws.replica
  }

  name       = "orders-archive"
  data_class = "confidential"
}

The stack controls identity and location. The module controls capability implementation.

That separation is non-negotiable.

10. Composition Root: Keep Environment Stacks Boring

A good module system makes environment stacks boring.

Boring does not mean tiny. It means predictable.

A stack should mostly contain:

provider configuration;
backend configuration;
locals for environment context;
module calls;
explicit dependencies;
outputs needed by adjacent stacks.

Example layout:

infra-live/
  prod/
    aws/
      us-east-1/
        network/
          backend.tf
          providers.tf
          main.tf
          outputs.tf
        data/
          backend.tf
          providers.tf
          main.tf
          outputs.tf
        services/
          quote-api/
            backend.tf
            providers.tf
            main.tf
            outputs.tf

The stack is the integration point.

Do not hide too much composition inside high-level modules. If a module creates network, databases, secrets, IAM, application deployment, and dashboards, then a plan diff becomes impossible to reason about.

The root stack should show the major lifecycle components.

11. Avoid Boolean-Driven Design

Boolean flags multiply state space.

A module with eight booleans has 256 theoretical combinations.

Most combinations are untested.

variable "enable_logs" { type = bool }
variable "enable_metrics" { type = bool }
variable "enable_backup" { type = bool }
variable "enable_replica" { type = bool }
variable "enable_public" { type = bool }
variable "enable_private" { type = bool }
variable "enable_iam" { type = bool }
variable "enable_policy" { type = bool }

This is not flexibility. This is unbounded product surface.

Prefer named profiles.

variable "runtime_profile" {
  type = string

  validation {
    condition = contains([
      "sandbox",
      "standard",
      "regulated",
      "high-availability"
    ], var.runtime_profile)
    error_message = "Unsupported runtime_profile."
  }
}

locals {
  profile = {
    sandbox = {
      backup_enabled  = false
      replica_enabled = false
      log_retention   = 7
    }
    standard = {
      backup_enabled  = true
      replica_enabled = false
      log_retention   = 30
    }
    regulated = {
      backup_enabled  = true
      replica_enabled = true
      log_retention   = 365
    }
    high-availability = {
      backup_enabled  = true
      replica_enabled = true
      log_retention   = 90
    }
  }[var.runtime_profile]
}

Profiles reduce invalid combinations and communicate intent.

12. Escape Hatches Must Be Explicit Products

Every platform module eventually meets a real edge case.

The wrong response is to add generic escape hatches everywhere.

variable "extra_policy_json" {
  type    = string
  default = null
}

variable "custom_security_group_rules" {
  type    = any
  default = []
}

That approach moves risk from the platform team to consumers without a review model.

A better escape hatch has:

a name;
a reason;
a reviewer;
a policy check;
an expiration date if possible;
evidence.

Example:

variable "approved_exceptions" {
  type = list(object({
    id          = string
    reason      = string
    expires_on  = string
    approved_by = string
  }))
  default = []
}

Then policy can validate it.

package iac.exceptions

deny[msg] {
  input.module.name == "object_storage_bucket"
  input.change.public_exposure == "public-approved"
  count(input.module.approved_exceptions) == 0
  msg := "public-approved exposure requires an approved exception"
}

The goal is not to ban exceptions.

The goal is to make exceptions visible, reviewable, and temporary.

13. Naming Is a Stability Problem

Naming looks cosmetic until replacement happens.

Many cloud resources cannot be renamed in place. A name change may force replacement, DNS changes, IAM policy updates, or consumer outages.

Module naming rules should be deterministic.

locals {
  resource_name = join("-", compact([
    var.org_prefix,
    var.environment,
    var.region_code,
    var.service_name,
    var.capability
  ]))
}

But deterministic does not mean opaque.

Bad:

name = "x9a-prod-ue1-qapi-obs-7f4"

Better:

name = "acme-prod-use1-quote-api-events"

A good naming scheme supports:

ownership discovery;
incident response;
cost allocation;
policy matching;
audit evidence;
stable imports;
provider length constraints.

For resources with globally unique names, use deterministic suffixes based on stable identity, not random values that change during refactors.

14. Tags and Labels Are Part of the Module Contract

Tags are not decoration.

They are control-plane metadata.

A production module should enforce a minimum metadata contract:

variable "owner_team" {
  type = string
}

variable "service_name" {
  type = string
}

variable "environment" {
  type = string
}

variable "data_class" {
  type = string
}

locals {
  mandatory_tags = {
    owner_team  = var.owner_team
    service     = var.service_name
    environment = var.environment
    data_class  = var.data_class
    managed_by  = "opentofu"
    module      = "object-storage-bucket"
  }

  tags = merge(local.mandatory_tags, var.extra_tags)
}

The policy engine can then reason about resources consistently.

Without metadata, later automation becomes guesswork.

15. Data Sources: Use Carefully

Data sources are reads from reality.

They are useful, but they can make plans less deterministic.

Examples:

data "aws_ami" "latest" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["al2023-ami-*"]
  }
}

This looks convenient, but it means the same Git commit may plan differently tomorrow.

That may be acceptable in a dev stack. It is usually risky in production.

Prefer explicit version inputs for critical artifacts:

variable "machine_image_id" {
  type        = string
  description = "Approved immutable image ID selected by the release pipeline."
}

Use data sources for stable discovery:

account identity;
current region;
existing platform-managed network ID;
approved parameter path;
remote state output from a stable stack.

Avoid data sources for mutable “latest” selection in production unless the selection itself is governed and recorded.

16. Remote State Dependencies: Treat as API Calls

Remote state output is a cross-stack API call.

If stack B reads outputs from stack A, then stack A has published an API.

This creates ordering constraints.

Design rules:

Only export stable outputs.
Avoid exporting entire resource objects.
Version output contracts when possible.
Keep dependency direction acyclic.
Document downstream consumers.
Avoid long chains of remote state dependency.

Bad dependency graph:

The cycle means your pipeline no longer has a clear apply order.

Cross-stack dependencies should form a directed acyclic graph.

17. Lifecycle Meta-Arguments Are Sharp Tools

Terraform/OpenTofu lifecycle controls can protect resources or hide dangerous behavior.

Example:

resource "aws_db_instance" "this" {
  # ...

  lifecycle {
    prevent_destroy = true
  }
}

prevent_destroy is useful for databases and critical stateful resources. But if every resource has it, routine cleanup becomes impossible.

ignore_changes is even sharper.

lifecycle {
  ignore_changes = [desired_count]
}

This can be valid when another controller owns the field. But it can also hide drift.

Classify ignored fields:

Ignore Reason	Acceptable?	Example
Owned by autoscaler	Yes	replica count
Owned by external controller	Yes, documented	generated annotation
Provider read noise	Sometimes	unstable timestamp
Manual production patch	Dangerous	security group rules
Avoiding a planned diff without understanding it	No	anything

Every ignore_changes should have a comment explaining owner and reason.

18. Module Testing Is Contract Testing

Testing modules is not only “does plan succeed.”

Test the contract.

18.1 Static Checks

Run:

formatting;
validation;
linting;
provider lock consistency;
documentation generation checks;
module metadata checks.

18.2 Input Validation Tests

Verify invalid combinations fail early.

Examples:

public exposure without exception;
regulated data without backup;
retention below minimum;
invalid owner team;
unsupported region.

18.3 Plan Snapshot Tests

For known fixture inputs, generate plans and compare expected structural behavior.

Do not snapshot every provider-computed value. Snapshot meaningful decisions:

number of resources;
resource types;
encryption enabled;
public access blocked;
tags present;
deletion protection enabled;
IAM policy shape.

18.4 Ephemeral Apply Tests

For critical modules, run short-lived apply tests in sandbox accounts or projects.

The test should create, verify, and destroy.

This catches provider behavior that static validation cannot catch.

18.5 Upgrade Tests

For each supported previous version, test upgrade to current.

This is what separates a real module product from a folder of HCL.

Upgrade test matrix:

From	To	Expected
v2.3.0	v2.4.0	no replacement
v2.4.0	v3.0.0	migration required
v1.9.5	v2.0.0	output rename compatibility verified

19. Documentation: Write the Operational Contract

Module docs should not merely list variables.

They should answer operational questions.

Minimum module README:

# object-storage-bucket

## Capability
Creates a private, encrypted, tagged object storage bucket with access logging and policy-managed exposure.

## When to Use
Use for application-owned object data that must be accessed by workloads through platform-managed IAM.

## When Not to Use
Do not use for public website hosting, cross-organization sharing, or ungoverned data lake storage.

## Security Controls
- Public access blocked by default.
- Encryption enforced.
- Mandatory ownership tags.
- Access policy generated by module.

## Inputs
...

## Outputs
...

## Upgrade Notes
...

## Known Replacement Risks
Changing `name` forces replacement.
Changing `data_class` from internal to regulated may add replication and logging resources.

## Examples
...

The best module docs explain consequences.

A variable table without consequences is not enough.

20. Deprecation and Migration Strategy

Modules evolve.

Production systems need a migration path.

A breaking module change should include:

release notes;
migration guide;
state move/import commands if needed;
expected plan diff;
rollback/rollforward guidance;
support window;
owner contact;
sample PR.

Example release note:

# v3.0.0 Migration Notes

## Breaking Change
The module now creates a separate access log bucket instead of reusing the data bucket prefix.

## Why
Required for regulated audit retention and lifecycle isolation.

## Expected Plan
- Creates one new log bucket.
- Adds bucket policy to allow log delivery.
- No replacement of existing data bucket.

## Required Action
Add `log_retention_days` explicitly if your workload is regulated.

## Rollback
Safe before apply. After apply, downgrade requires manual cleanup of the log bucket.

Do not ship breaking changes as mysteries.

21. State Refactor: Move Blocks and Import Blocks

Resource address changes are dangerous because state maps addresses to real objects.

Renaming a resource without state movement may cause the engine to plan delete/create.

Use explicit state migration features where supported.

Example conceptual refactor:

moved {
  from = aws_s3_bucket.main
  to   = aws_s3_bucket.this
}

This tells the engine that the address changed but the object identity remains.

For existing resources not yet managed, use import workflows carefully and review the first plan after import.

State migration should be treated like database migration:

reviewed;
tested;
reversible if possible;
documented;
tied to a module version.

22. Policy-Compatible Module Design

A module should make policy easy.

Policy engines inspect planned resources and configuration. If module design hides intent or produces inconsistent metadata, policy becomes brittle.

Good module input:

variable "data_class" {
  type = string
}

Good tags:

locals {
  tags = {
    data_class = var.data_class
    owner_team = var.owner_team
    managed_by = "opentofu"
  }
}

Good policy:

package iac.storage

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  not resource.change.after.tags.data_class
  msg := sprintf("%s is missing data_class tag", [resource.address])
}

If every module invents different tag keys, policy must encode a dictionary of exceptions.

A strong module system creates uniform policy shape.

23. Module Registry as a Product Surface

A module registry is not only a download mechanism.

It is a product catalog.

A good internal registry shows:

module name;
capability description;
owner;
support status;
latest version;
compatibility matrix;
security posture;
examples;
migration notes;
deprecation status;
known replacement risks.

Do not let random modules become production dependencies without ownership.

Every production module needs a maintainer and support policy.

24. Review Heuristics for Module PRs

When reviewing a module PR, do not only read the diff.

Ask these questions:

24.1 API Surface

Is a new input truly a consumer decision?
Is the type specific enough?
Are invalid combinations rejected?
Is the default safe?
Is this adding a long-term support burden?

24.2 Lifecycle

Could this change force replacement?
Which resource address changes?
Is state migration needed?
Does this affect existing consumers?
Does rollback work after apply?

24.3 Security

Does the module weaken baseline controls?
Does it add an escape hatch?
Are exceptions explicit and auditable?
Are tags/labels preserved?

24.4 Compatibility

Is this patch/minor/major?
Are release notes updated?
Are examples updated?
Are previous-version upgrade tests included?

24.5 Operations

Are outputs stable?
Are logs/metrics/audit metadata present?
Is ownership visible?
Are failure modes documented?

A senior reviewer protects future operators from today's convenience.

25. Example: Designing an Object Storage Module

Let us design a production module from scratch.

25.1 Capability Statement

Creates a private, encrypted object storage bucket for application-owned data.
The module enforces public access blocking, ownership tags, managed encryption, access logging, and lifecycle rules based on data classification.

25.2 Allowed Consumer Decisions

Decision	Input
What is the bucket for?	`purpose`
Who owns it?	`owner_team`
What data class?	`data_class`
How long retain objects?	`retention_days`
Which workloads need access?	`readers`, `writers`
Is public exposure required?	`public_exposure`, exception only

25.3 Disallowed Consumer Decisions

Disallowed	Why
Disable encryption	Violates baseline
Disable public access block	Requires exception process
Arbitrary bucket policy JSON	Hard to validate and audit
Untagged resource creation	Breaks inventory and cost controls
Random name override	Breaks naming and import conventions

25.4 Interface Sketch

variable "name" {
  type        = string
  description = "Stable logical bucket name. Changing this may force replacement."
}

variable "owner_team" {
  type        = string
  description = "Team accountable for lifecycle, cost, and incidents."
}

variable "data_class" {
  type        = string
  description = "Data classification used to derive security controls."

  validation {
    condition     = contains(["internal", "confidential", "regulated"], var.data_class)
    error_message = "data_class must be internal, confidential, or regulated."
  }
}

variable "retention_days" {
  type        = number
  description = "Minimum object retention period."

  validation {
    condition     = var.retention_days >= 7
    error_message = "retention_days must be at least 7."
  }
}

variable "public_exposure" {
  type        = string
  default     = "private"
  description = "Exposure class. public-approved requires approved exception."

  validation {
    condition     = contains(["private", "public-approved"], var.public_exposure)
    error_message = "Unsupported public_exposure."
  }
}

25.5 Derived Controls

locals {
  regulated = var.data_class == "regulated"

  access_log_retention_days = local.regulated ? 365 : 90
  versioning_enabled        = contains(["confidential", "regulated"], var.data_class)
  deletion_protection       = local.regulated

  tags = {
    owner_team = var.owner_team
    data_class = var.data_class
    managed_by = "opentofu"
    module     = "object-storage-bucket"
  }
}

The consumer states the classification. The module derives controls.

That is the point.

26. Example: Module Release Pipeline

A production module repo should have its own pipeline.

Do not release module changes directly from untested local machines.

Module release is part of the platform supply chain.

27. Anti-Patterns

27.1 The God Module

One module creates everything.

platform-service/
  creates network
  creates database
  creates iam
  creates kubernetes namespace
  creates helm release
  creates dashboards
  creates alerts

The plan is unreadable. The blast radius is unclear. Upgrade is terrifying.

27.2 The Transparent Wrapper

The module exposes every provider option.

It creates no policy value.

27.3 The Boolean Matrix

Thirty flags define hundreds of untested combinations.

27.4 The Unowned Module

Everyone uses it. Nobody owns it. No one knows whether it is safe to upgrade.

27.5 The Hidden Provider

The module configures providers internally and silently creates resources in unexpected accounts or regions.

27.6 The Output Leak

The module exports entire provider objects and accidentally freezes implementation details.

27.7 The Branch-Pinned Production Module

Production points at main. The same environment commit can mean different infrastructure tomorrow.

27.8 The Escape Hatch Platform

Every module has custom_json, extra_rules, skip_policy, and allow_anything.

This is not a platform. It is a liability generator.

28. Failure Model

Failure	Cause	Prevention	Recovery
Module upgrade replaces critical resource	Breaking change shipped as minor	upgrade tests, release notes, replacement detection	stop apply, state review, restore previous version, migrate carefully
Consumer depends on internal output	output leaked provider object	export stable outputs only	add compatibility output, deprecate slowly
Policy cannot classify resource	inconsistent tags/labels	mandatory metadata contract	fix module, backfill tags
Stack plans differently tomorrow	unpinned module/provider/latest data source	pin versions, lock dependencies	reproduce with lock file, pin artifact
Provider configured inside module	hidden region/account	root-only provider config	refactor provider passing, state migration if needed
Boolean combination untested	flag explosion	profile-based API	introduce profiles, deprecate flags
State refactor causes recreate	address renamed without move	moved/import blocks, upgrade tests	stop apply, move state, re-plan
Exception becomes permanent	ungoverned escape hatch	exception object with expiry	audit exceptions, remove or formalize

Failure modeling should happen during module design, not after the incident.

29. Production Checklist

Before publishing a module, verify:

30. Practice: Redesign a Weak Module

Take this weak interface:

module "database" {
  source = "./modules/db"

  name                       = "orders"
  engine                     = "postgres"
  version                    = "16"
  public                     = false
  encrypted                  = true
  backup                     = true
  backup_days                = 7
  deletion_protection        = false
  allow_major_version_upgrade = true
  custom_parameter_group     = var.custom_parameter_group
  custom_security_group_ids  = var.custom_security_group_ids
  tags                       = var.tags
}

Redesign it as a platform module.

A stronger interface might be:

module "orders_database" {
  source  = "app.terraform.io/acme/postgres-database/platform"
  version = "~> 4.1"

  name              = "orders"
  owner_team        = "order-platform"
  environment       = "prod"
  data_class        = "confidential"
  availability_tier = "standard-ha"
  capacity_profile  = "medium"
  maintenance_window = "sun:03:00-sun:04:00"
}

The module derives:

encryption;
backup retention;
deletion protection;
network placement;
monitoring;
tags;
allowed upgrade behavior;
audit metadata.

The consumer should not need to know every provider knob to request a safe database.

31. What You Should Internalize

A production IaC module is a product boundary.

It is not a convenience folder.

Strong module systems have clear capability names, safe defaults, narrow inputs, stable outputs, explicit versioning, visible ownership, migration paths, and tests that verify behavior across upgrades.

Weak module systems expose provider chaos and call it flexibility.

The top-level skill is judgment:

expose what consumers should decide; hide what the platform must guarantee; document what may break; test what must not break.

If you master that sentence, your IaC starts looking less like scripts and more like a real infrastructure platform.

References

OpenTofu Modules: https://opentofu.org/docs/language/modules/
OpenTofu Version Constraints: https://opentofu.org/docs/language/expressions/version-constraints/
OpenTofu Provider Configuration: https://opentofu.org/docs/language/providers/configuration/
OpenTofu State: https://opentofu.org/docs/language/state/
OpenTofu Workspaces: https://opentofu.org/docs/language/state/workspaces/
Terraform Module Block Reference: https://developer.hashicorp.com/terraform/language/block/module
Terraform Provider Requirements: https://developer.hashicorp.com/terraform/language/providers/requirements
Terraform Providers Within Modules: https://developer.hashicorp.com/terraform/language/modules/develop/providers
Terraform Version Constraints: https://developer.hashicorp.com/terraform/language/expressions/version-constraints

Lesson Recap

You just completed lesson 09 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 08

Terraform/OpenTofu State Model and Failure Modes

Next Lesson

Lesson 10

Environment Modeling Without YAML Hell