Stochastic Dominance and Risk-Aware Objectives

When the objective depends on the whole return distribution rather than just its mean, stochastic dominance becomes a useful way to compare policies. Instead of asking only whether one policy has a larger expected return, we ask whether one return distribution should be preferred by a broad class of decision makers.

This note is also a compact companion to [1]. The goal here is to connect the standard stochastic-ordering definitions to the exact FSD-based reward-learning and risk-aware policy ideas used in that paper.

Let and denote two random returns, with cumulative distribution functions and . A smaller CDF means more probability mass is shifted to the right, so larger returns become more likely.

Theorem 1: First-order dominance and monotone utility ([2]; [3])

For integrable random returns and , the following are equivalent:

  • The CDF criterion:

  • The utility criterion:

    for every increasing utility function for which the expectations exist.

So first-order stochastic dominance is exactly the preference order shared by all decision makers who simply prefer more return to less return.

Zeroth-, First-, and Second-Order Dominance

At the most conservative level, zeroth-order dominance compares the worst parts of the support.11 The term zeroth-order stochastic dominance is not completely standardized. In this note, I use it in the informal optimization sense: a support-level or worst-case comparison, closer to robust dominance than to the classical first-/second-order hierarchy.

If policy is never worse than policy in the relevant lower tail, then dominates at zeroth order. This viewpoint is closely related to maximizing a worst-case objective such as

or, more generally, protecting the lower support of the return distribution. It is strong, but often too conservative because it cares heavily about rare bad outcomes.

Figure 1: CDF view of a zeroth-order style comparison: the red curve places mass farther into the bad left tail, while the blue curve avoids it.

First-order stochastic dominance (FSD) is the classical monotonicity notion:

Equivalently, every threshold is at least as favorable under as under . Another equivalent statement is

for every increasing utility function . So FSD says that all decision makers who simply prefer more return to less return would weakly prefer .

Figure 2: CDF view of FSD: the blue CDF stays below the red CDF at every threshold.

Second-order stochastic dominance (SSD) relaxes FSD by allowing local crossings while controlling the cumulative downside:

This is equivalent to

for every increasing concave utility function . In other words, SSD is the preference order that matches all risk-averse expected-utility decision makers.22 Concavity encodes diminishing marginal utility. That is exactly why SSD appears whenever we want guarantees that are robust over a class of risk-averse objectives, rather than tailored to one hand-picked risk metric.

Theorem 2: Second-order dominance and risk aversion ([4]; [3])

For integrable random returns and , the following are equivalent:

  • The integrated-CDF criterion:

  • The utility criterion:

    for every increasing concave utility function for which the expectations exist.

This is why SSD is the canonical dominance notion for broad classes of risk-averse objectives.

Figure 3: CDF view of SSD without FSD: the curves cross, so FSD fails, but the blue distribution is less spread out and SSD still holds.

Connection to Distributional IRL

In [1], the expert return distribution is the reference object, while a candidate policy induces a return distribution under the learned stochastic reward model. The reward-learning stage asks that the expert distribution dominate the candidate one in the FSD sense, so the FSD-violation loss should be written as

With this ordering, the loss is zero exactly when

So the objective is stronger than mean matching: once the expert dominates in FSD, the mean-dominance corollary follows automatically, giving

Proposition 3: Quantile form of the FSD violation ([1])

The same loss can be written in quantile space as

This is the algorithmic bridge used in the paper: once the loss is written with quantiles, Monte Carlo return samples and empirical order statistics become natural approximations.

Relation to Risk-Aware Objectives

A risk-aware objective maps a full return distribution to a scalar score:

The choice of determines which parts of the distribution we care about.

If , the objective is risk-neutral. Mean return alone can miss important structure: two policies with the same expectation may have very different tails, variances, or catastrophe probabilities.

If is monotone, then FSD is a minimal consistency requirement:33 For reward maximization, monotonicity means that if one return is pointwise no smaller than another, its score should not decrease. Most sensible risk measures or utility-based objectives satisfy this.

So FSD gives a preference-robust guarantee across a broad family of risk-aware objectives, including many distortion-based and utility-based criteria. That is exactly the logic used in [1]: reward learning enforces a strong distributional ordering, while policy learning can then optimize a chosen risk-aware summary of the learned return distribution.

Lemma 1: Monotone objectives respect FSD ([3])

Let be any monotone objective on return distributions, meaning that if one distribution shifts upward pointwise then its score cannot decrease.

This lemma is simple but useful: FSD improvement is stronger than improvement for any one particular monotone risk-aware metric.

In the policy-learning step of [1], one convenient family is the distortion risk measures

where is the dual distortion. This lets one encode a chosen risk attitude without abandoning the return-distribution viewpoint.

Proposition 4: Dominance for every DRM implies FSD ([1])

If

for every distortion function , then

So one fixed distortion risk measure captures one particular risk attitude, while FSD is the stronger uniform statement across the entire distortion family.

If for an increasing concave utility , then the objective is risk-averse in the classical expected-utility sense. In that case SSD is the natural dominance notion:

This is why SSD is often the right language when discussing safety margins, downside sensitivity, or conservative policy selection. It does not pick one single risk attitude; instead, it guarantees preference for an entire family of risk-averse utilities.

Zeroth-order dominance is even more conservative. It lines up with robust-control or worst-case objectives that care primarily about the left edge of the distribution. Such objectives can be desirable in safety-critical settings, but they may ignore large improvements in typical performance.

Worked Examples

Here are a few concrete comparisons that make the hierarchy easier to remember.

Example 1: Same mean, different downside. Take almost surely and let equal with probability , with probability , and with probability . Then , but SSD-dominates because is a mean-preserving spread. A risk-neutral objective is indifferent, while any concave-utility objective prefers .

Example 2: Better tail quantiles without changing everything else. Suppose two policies have nearly identical central mass, but policy reduces the probability of catastrophic return from to . Even if the means stay close, a tail-sensitive objective such as CVaR or a lower-quantile objective may improve sharply. If the entire CDF shifts downward, then the improvement is stronger than CVaR improvement alone: it is an FSD improvement.

Example 3: Robust preference versus average preference. Let have support in and have support in . If your application is safety-critical, then the worst-case part of the support may matter more than the mean. A zeroth-order viewpoint will prefer because it lifts the floor, even if a more average-case objective might still prefer .

Example 4: RL interpretation. In reinforcement learning, two policies can have the same expected return but very different rollout distributions. A risk-neutral benchmark may see them as tied, SSD will prefer the policy with less dispersion, and a worst-case objective may prefer the policy with the better lower support. This is exactly why return-distribution modeling is often more informative than reporting only mean episodic reward.

Practical Takeaway

Risk-aware optimization and stochastic dominance are not the same thing, but they complement each other:

In practice, a useful pattern is to optimize a chosen risk-aware objective while checking whether the resulting policy improves the return distribution in a stronger dominance sense. When that happens, the policy is not only better for one metric, but better for a much wider set of decision makers.

Bibliography

  • [1] F. Wu, Y. Zhao, and A. Wu, “Distributional Inverse Reinforcement Learning.” [Online]. Available: https://arxiv.org/abs/2510.03013
  • [2] J. Hadar and W. R. Russell, “Rules for Ordering Uncertain Prospects,” The American Economic Review, vol. 59, no. 1, pp. 25–34, 1969.
  • [3] M. Shaked and J. G. Shanthikumar, Stochastic Orders. Springer, 2007. doi: 10.1007/978-0-387-34675-5.
  • [4] M. Rothschild and J. E. Stiglitz, “Increasing Risk: I. A Definition,” Journal of Economic Theory, vol. 2, no. 3, pp. 225–243, 1970, doi: 10.1016/0022-0531(70)90038-4.