RL for Resource Allocation: Combinatorial Multi-armed Bandits

November 19, 2024

Imagine you're at an imaginary casino, and you can only play one game: Slot Machines. And these slot machines are a bit special. Each slot machines draws rewards from a constant underlying distribution. Your objective is simple: you have a set of arms, each with an unknown reward distribution, and at every round you pull one arm and observe a reward. Over time, your goal is to walk out with the largest reward possible.

Formally, suppose we have $K$ arms. Pulling arm $i$ at time $t$ yields a reward

r_{i,t} \sim \mathcal{D}_i,

where $\mathcal{D}_i$ is unknown. The objective is to choose arms $a_t \in \{1,\dots,K\}$ to maximize

\sum_{t=1}^T r_{a_t,t}.

The difficulty lies in the fact that you do not know the reward distributions in advance. You have to learn them online while still making good decisions. This is where the classic exploration–exploitation tradeoff comes in. There are different ways to handle the exploration–exploitation tradeoff. $\epsilon$ -greedy mostly plays the arm that currently looks best, but with some probability $\epsilon > 0$ it explores a random arm instead. UCB is more systematic and optimistic, favoring arms that are uncertain until enough evidence is gathered. Thompson Sampling takes a probabilistic view, sampling from its belief about each arm’s reward and playing the arm that looks best under that sample.

From single arms to combinatorial actions

The classic bandit assumes that at each round you select exactly one arm. In many problems, this assumption just does not hold. Instead, you choose multiple arms at the same time, often under some structure or constraint.

This brings us to Combinatorial Multi-Armed Bandits (CMAB). Instead of choosing a single arm, you choose a super arm, which is a subset or structured combination of base arms.

Let $\mathcal{A} \subseteq 2^{[K]}$ denote the set of valid actions. At round $t$ , you choose

A_t \in \mathcal{A},

and observe a reward that depends on all arms in $A_t$ . A common assumption is additive reward:

R_t(A_t) = \sum_{i \in A_t} r_{i,t},

though more general reward models are possible.

The problem is that $|\mathcal{A}|$ can be exponentially large. Even if learning each arm is easy, choosing among all valid combinations is not.

Knapsack constraints and resource allocation

One of the less useful CMAB variants (I know, right? 😂) is the knapsack-constrained setting. Here, each arm has an associated cost, and every action must respect a total budget.

Discrete setting

Suppose each arm $i$ has cost $c_i$ , and the total budget is $Q$ . A valid action must satisfy

\sum_{i \in A_t} c_i \le Q.

This already looks like a knapsack problem. But we can go one step further. Instead of deciding whether to pick an arm or not, we decide how much resource to allocate to each arm.

Let $a_{i,t} \in \mathbb{Z}_{\ge 0}$ denote the amount of resource allocated to arm $i$ at time $t$ . The constraint becomes

\sum_{i=1}^K a_{i,t} \le Q,

and the reward now depends on the allocation vector

a_t = (a_{1,t}, \dots, a_{K,t}).

This is exactly the structure of resource allocation problems: a limited budget, discrete decisions, and uncertain returns. CMAB provides a clean modeling framework for this.

However, this formulation introduces a serious challenge. The combinatorial action space explodes: each arm can be assigned any integer amount up to $Q$ , leading to on the order of $Q \times K$ possible choices. Even for modest values of $Q$ and $K$ , this quickly becomes intractable.

Continuous setting

As you can imagine, it does not get any easier in the continuous setting. Here, we assume the total resource budget $Q$ is continuous, meaning each arm can receive any amount in $[0, Q]$ rather than an integer allocation. As a result, the action space is no longer finite but infinite.

Formally, instead of choosing a discrete allocation vector, we choose

a_t \in \mathcal{A}_c = \left\{ a \in \mathbb{R}_{\ge 0}^K \;:\; \sum_{i=1}^K a_i \le Q \right\}.

This immediately breaks the standard CMAB framework, since we now have an uncountably infinite number of base arms.

To make the problem tractable, most approaches rely on discretization. The idea is simple: approximate the continuous action space $\mathcal{A}_c$ with a finite discretized version $\tilde{\mathcal{A}}_c$ . If the reward functions are well-behaved, for example Lipschitz continuous in the allocated resource, then this approximation introduces only a controlled error.

Under this view, regret naturally decomposes into two parts: a learning regret from operating on the discretized action space, and a discretization error from approximating the true continuous problem. Finer discretization reduces approximation error but increases the learning burden, leading to a fundamental tradeoff between statistical efficiency and computational tractability. In practice, this makes the continuous setting significantly more challenging than the discrete case, and careful design choices are required to balance exploration, approximation accuracy, and runtime.

Different Approaches in the Literature

CUCB Style Algorithms (CUCB-DRA, CUCB-CRA) [1, 3]

The idea here is actually very simple. First, the space of possible allocations (the super-arms) is discretized. Each discretized allocation is then treated as its own arm. The algorithm keeps track of two quantities for each super-arm:

its empirical mean reward, and
how many times it has been played.

Using these, an upper confidence bound is constructed for every super-arm. At each round, the algorithm selects the allocation with the highest upper bound. The selection step itself can be viewed as solving a (possibly approximate) multi-choice knapsack problem, since we are choosing how to distribute a limited budget across arms.

This approach is appealing because it is conceptually clean and comes with strong theoretical guarantees. However, it struggles in practice due to scalability. As the budget and number of arms grow, the number of super-arms explodes, making both learning and decision-making increasingly brittle.

There are three main issues:

Combinatorial over-optimism. In large action spaces, UCB-style optimism becomes a liability. With many super-arms that are rarely played, the algorithm remains overly optimistic about poor allocations and is repeatedly drawn toward them, leading to inefficient exploration and poor empirical performance.
Monotonicity assumptions. These methods typically assume that rewards are monotonic in the allocated resource. While this simplifies analysis, it is often unrealistic in real-world settings where diminishing returns, saturation effects, or threshold behaviors are common.
Independence across super-arms. Each allocation is treated as an independent arm, ignoring the strong correlations between similar allocations. In practice, allocating slightly more or less resource to the same arm should yield similar rewards, but CUCB-style methods fail to exploit this structure.

Correlated combinatorial bandits for online resource allocation

A key limitation of the CUCB-style methods [1, 3] is that they treat different allocations as independent, even though in practice they are often highly correlated. For example, allocating slightly more or slightly less resource to the same arm should produce similar rewards.

To address this, Gupta et al. introduce correlated combinatorial bandits for online resource allocation. The main idea is to explicitly model the correlations between rewards obtained under different budget allocations. Instead of learning each allocation independently, the algorithm shares information across related allocations.

By exploiting these correlations, their proposed Correlated-UCB algorithm can achieve significantly lower regret than correlation-agnostic methods. In some structured settings, it even achieves bounded regret, which is a dramatic improvement over the logarithmic regret guarantees typical in standard CMAB approaches.

This line of work shows that structural assumptions about the reward function can fundamentally change what is achievable in online resource allocation.

SEQIOUA

More recently, deep reinforcement learning methods such as SEQIOUA [4] have been proposed to handle large-scale combinatorial decision-making problems. I will talk about this in more detail in a later post.

References

J. Zuo and C. Joe-Wong, Combinatorial Multi-Armed Bandits for Resource Allocation, 2021.
Chen et al., Combinatorial Multi-Armed Bandit: General Framework and Applications, 2013.
S. Gupta, J. Zuo, C. Joe-Wong, G. Joshi, and O. Yağan, Correlated Combinatorial Bandits for Online Resource Allocation, MobiHoc 2022.
L. Xu, B. Wilder, E. B. Khalil, and M. Tambe, Reinforcement Learning with Combinatorial Actions for Coupled Restless Bandits, arXiv preprint arXiv:2503.01919, 2025.