Next: Modeling Other Agents
Up: Dynamic Multiagent Systems
Previous: General framework
We further assume all states are observable and each agent knows its own reward
function. The only thing unknown is a-i, the actions to be
taken by other agents. The state evolves according to st+1=p(st,
at), where
st+1i=pi(sti, at). That is, agent i's
state at t+1 depends on the agent's current state and the current
joint action. We assume that the transition function h is
deterministic. We allow both the state and action spaces be real (i.e.,
continuous) domains.
The agent's objective is to

Agent i's reward rit is the
improvement in utility at time t: rti= Ui(sti) -
Ui(st-1i). Note that the agent's utility is a function of its
local state. Since
| ![\begin{displaymath}
\sum_{t=1}^{T} r_{t}^{i} = \sum_{t=1}^{T} \left[ U^{i}(s_{t}^{i}) -
U^{i}(s_{t-1}^{i})\right] = U^{i}(s_{T}) - U^{i}(s_{0}),\end{displaymath}](img16.gif) |
(2) |
and Ui(s0) is a constant independent of i's
actions, we see that maximizing the sum of rewards is equivalent to
maximizing final utility.
For agent i, its each period's rewards,
rt+1i = Ui(pi(sti,at)) - Ui(sti),
depend on the actions of other agents, at-i.
The object of the learning problem is to predict these
actions--explicitly or implicitly--so that the agent can
effectively choose its own.
To simplify the learning problem, we assume that the agent makes its
decisions myopically , that is, considering only the current
time period. To maximize the current reward at t (2), the
agent solves

A generally applicable (albeit not optimal) approach is to form an
estimate,
, of the other agents' actions, and solve the
problem as if the estimate were correct. The entire decision then
reduces to the question of how to form estimates.
Next: Modeling Other Agents
Up: Dynamic Multiagent Systems
Previous: General framework
Junling Hu
4/27/1999