Most discussions about AI infrastructure converge on the same image: dense GPU clusters, extreme cooling loads, power measured in megawatts. That picture is accurate for one phase of the AI lifecycle—training. It does not describe what most production AI infrastructure actually looks like.
Training and inference are two fundamentally different workloads with different power profiles, different density requirements, different network constraints, and different cost structures. Understanding the distinction determines where you build, how much you spend, and whether your deployment stays within Indonesia’s regulatory boundaries.
What Training and Inference Actually Involve
AI training is the process of building a model. A neural network processes a large dataset and adjusts billions of internal parameters to improve its predictions. For large language models, this requires thousands of GPUs operating in tight coordination over weeks or months—continuous, enormous compute demand that is highly sensitive to inter-GPU communication speed.
AI inference is what happens after training: the finished model serves requests from real users. A chatbot answers a question, a fraud detection system scores a transaction, a recommendation engine returns results. Each inference request is computationally lighter than a training step, but inference runs continuously at scale across millions of requests per day.
According to the International Energy Agency’s Energy and AI report, inference already accounts for a larger share of AI compute than training in many deployed systems. By 2026, it is projected to represent roughly two-thirds of total AI compute globally—up from one-third in 2023. The infrastructure conversation has shifted accordingly.
Power and Density: Where Training and Inference Diverge
Training infrastructure is defined by density. A rack of NVIDIA H100 or Blackwell-generation GPU accelerator hardware can draw 40–140 kW or more, requiring direct-to-chip liquid cooling that air systems cannot match. As global data center electricity consumption climbs toward 1,050 TWh annually, training workloads drive the highest-density facilities being built today. Demand is also bursty: once a model is trained, those GPUs are redeployed or sit idle.
Inference has a different power profile—typically 10–30 kW per rack, within the range that direct-to-chip liquid cooling or high-end air systems can handle. What inference demands is consistency: stable power with a guaranteed SLA, because inference is always on. A 30-minute outage at a training cluster is an inconvenience. The same outage at an inference cluster is a service failure affecting real users.
Latency Is the Defining Constraint for Inference
Training is throughput-bound—the goal is maximum compute operations per unit of time, and no user is waiting on a training job.
Inference is latency-bound. A fraud detection model must return a decision before a payment times out. A chatbot must respond fast enough to feel like a conversation. The commercial viability of an AI application is directly tied to network latency between the inference server and the end user.
Geography therefore matters for inference in a way it does not for training. Physical distance adds unavoidable latency. An inference deployment hosted offshore and accessed over the public internet will deliver measurably worse response times than one hosted locally inside a well-connected facility. For applications serving Indonesian users, that means infrastructure inside the country, in a facility with rich carrier and peering options—or distributed further to edge data centers for the most latency-sensitive workloads.
What Inference-Optimised Infrastructure Looks Like
| Requirement | Training | Inference |
|---|---|---|
| Rack power density | 40–140+ kW | 10–30 kW typical |
| Cooling type | Liquid (direct-to-chip or immersion) | Air or hybrid liquid |
| Network priority | Inter-GPU bandwidth (InfiniBand) | Low-latency connectivity to users and cloud |
| Demand pattern | Bursty (weeks, then idle) | Continuous, always-on |
| SLA priority | Maximum throughput during run | Power and connectivity uptime |
Inference deployments frequently use a hybrid architecture: model weights and serving infrastructure in colocation, with data pipelines and logging connecting to cloud storage via cloud exchange connectivity. The inference facility needs direct, private access to cloud providers—not just raw connectivity, but dedicated paths that bypass public internet variability. Horizontal scaling is the standard growth model, so the facility must support incremental expansion with predictable lead times.
The Cost Case for Colocation at Inference Scale
Cloud is the natural starting point for AI deployment—managed APIs, auto-scaling, and zero upfront hardware cost make sense during experimentation. At production volume, the economics shift.
Cloud inference is billed on compute time and data transfer. Egress fees—charged when data leaves the provider’s network—compound at scale. An application serving millions of requests per day generates significant outbound transfer; over a year, egress costs frequently exceed the cost of equivalent colocation infrastructure.
Colocation eliminates egress for traffic that stays within the facility or moves via private interconnection. A company running inference in colocation and connecting to cloud storage via a private exchange pays for cloud services used—not for every byte crossing the provider’s network boundary.
The Indonesia Angle: Compliance Is Part of the Infrastructure Decision
Indonesia’s Personal Data Protection Law restricts cross-border transfers of personal data. AI applications processing financial records, health data, or communications cannot route that data to offshore inference servers without significant legal exposure. For financial institutions under OJK oversight, the constraint is explicit: data processing infrastructure must remain onshore.
Running LLM-based fraud detection or credit scoring on offshore cloud inference puts a regulated institution in a difficult position. The same application running inside a Jakarta colocation facility operates within the jurisdiction where the data was collected—and serves Indonesia’s 272 million internet users with lower round-trip latency than any offshore region can offer.
Conclusion
Training demands dense, bursty compute for a defined window. Inference demands consistent uptime, low latency, and cost-efficient scaling at production volumes—requirements that point toward colocation inside the market you serve.
The clearest signal that a deployment has crossed into inference-scale is when the question shifts from “how many GPUs can we stack” to “how quickly can users reach the model, and at what cost per request.” That is when proximity, carrier options, and SLA coverage matter more than raw density. Deloitte’s AI compute analysis projects the share of infrastructure investment going to inference will grow steadily as the industry’s centre of gravity moves from building AI to deploying it.
If you’re evaluating colocation infrastructure for AI inference workloads in Indonesia, talk to the Digital Edge Indonesia team about power density, connectivity options, and capacity planning at EDGE1, EDGE2, and CGK Campus.





