Optimal Subset Selection in C++ with OR-Tools Constraint Solvers

Summary

This article introduces a common yet complex problem of finding an optimal subset from a given set of items. To be considered valid, a subset must comply with several rules. Likewise, the optimal subset is determined by a defined value function. To illustrate this problem, a simple knapsack like problem is defined to help explore 4 potential solutions. These approaches are; brute force, greedy algorithms, Artificial Intelligence (AI), and Constraint Programming (CP) Solvers. The problem is divided into 2 parts; a simple version involving only the first 2 constraints and the optimization criteria, and the full version which includes the remaining 4 constraints (3 - 6). For each solution, a complete working code sample is provided along with a docker image to simplify the build process. We conclude by discussing the strengths and weaknesses of each solution, with a particular focus on the merits of CP Solvers.

Introduction
Brute Force
Greedy Algorithms
- Simple Greedy Example
- Extended Greedy
Artificial Intelligence Limitations
Constraint Programming Solver
MIP Solver
Benchmarks
Conclusion

Introduction

Partitioning is the action of dividing a set of items into smaller subsets of items. If one were to play a team sport such as baseball the players would be partitioned into the two teams. There are various approaches for partitioning the entities into subsets. Naturally, one may wish to partition the sets to retrieve the best or optimal subset. To find an optimal subset one would need to consider some criteria to determine optimal vs sub-optimal subsets. Furthermore, one may wish to exclude certain subsets that violate some functional or business constraint. Going back to the baseball example, valid team must include a minimum number of players. Whereas an optimal team would be one with players who perform the best together compared other teams. Similarly, an invalid team would consist of too few players. We will discuss several potential solutions including: brute force, greedy algorithms, AI, and CP Solvers. In the code examples for CP Solvers the OR-Tools library is used.

Code Examples

As mentioned this post will contain several code examples. The code examples contained within the post will often be snippets to highlight a specific section. Full code of all the examples referenced in this post are available on GitHub. Likewise, a Docker container is also available to simplify the build environment necessary for running the examples. To run any example, first build and run the docker image:


docker build . -t cpsolver:latest
docker run --name cpsolver \
    --mount type=bind,source="$(pwd)"/src,target=/cpp/src/cpsolver \
    cpsolver:latest

Then in a separate terminal launch an interactive console for the docker image:


docker exec -it cpsolver bash

Then navigate to the example folder (e.g. e1), build, and run using:


cd e1/
mkdir -p build
cd build
cmake ..
make
./greedy_example_1

The output of the first example should be:


Valid Parameters
Value: 20
Weight: 19
Volume: 0

Chosen:
[
    {
        "value": 10,
        "volume": 0,
        "weight": 10
    },
    {
        "value": 10,
        "volume": 0,
        "weight": 9
    }
]

A data generator, written in Crystal, is also included in the repository along with several example data sets. This data generator is similar to the data generator discussed in the AI section. It can be used to produce an example U based on defined criteria. To generate a new example set of items do the following within the docker image:


cd generator/
shards install
crystal build main.cr
./main > my_items.json

The generated data can then be used by any of the code examples by passing it in as an argument:


cd e1/build
./greedy_example_1 ../../generator/my_items.json ../../generator/params.json

The first argument is the generated items JSON file. Whereas the second is the constraint thresholds JSON file. An example data generated JSON file has been included as generator/items.json along with a sample constraint thresholds file as generator/params.json.

Background

The form of this problem involves a universe of entities U. Each entity in U would have z comparable attributes associated with them. Such that each attribute for one entity will be comparable in type and format to the attributes of another entity.

... ... ...

A possible partition of entities from U is defined as S_i, where S would be the set of all possible subsets.

... ... ...

Next, the optimization objective can be defined as Q and finally, the list of constraints as C.

...

Problem Definition

The example problem is defined as the following; we have a warehouse full of packages that need to be shipped. Unfortunately, we only have 1 vehicle to perform the shipping and therefore we need to choose which items to place into the vehicle carefully. Ideally, we can choose an optimal shipment. There is 5 attributes for each single package:

Value — The profit of shipping the package.
Weight — The package weight.
Volume — The size of the package. For simplify, if there is sufficient volume available, the package will fit.
Product Type — The type of products classified into a limited set of types such as: toy, electronic, clothing, food, etc.
Manufacturer — The name of the manufacturer for the product.

Each shipment must adhere to various a combination of business and practical constraints:

The total weight of the packages on the truck (within S_i) must not exceed the maximum weight (CW_{max}). ...
The total volume of the packages on the truck (within S_i) must not exceed the maximum volume (CZ_{max}). ...
The minimum value of a shipment must exceed CV_{min}. ...
The highest value package within the shipment cannot exceed CPV_{max} as a percentage of the total value of the group. This will help ensuring the shipment is not biased towards only single high value items. ...
The aggregated value per product type cannot exceed CPT_{max} as a percentage of the total value of the group. This will help ensuring the shipment will contain various types of products. Where PT_i is the list of groupings of S_i by product type. ... ... ...
The aggregated value per manufacturer cannot exceed CPM_{max} as a percentage of the total value of the group. This will help ensuring the shipment will contain various products from different manufacturers. Where M_i is the list of groupings of S_i by manufacturer type. ... ... ...

Finally, the optimization objective is to maximize the value of the shipment and can be written formally as:

...

Brute Force

One may be tempted to try a brute force search for a solution. This would involve simply iterating through all possible permutations until the best solution is found. However, this becomes quickly unfeasible, as the number of possible solutions is vast. To illustrate this, let us presume the only constraint is that the truck can carry no more than 10 items. To use "brute force" to find an optimal solution each possible solution would need to be generated and compared. Given 100 packages, number of possible solutions would be:

...

Naturally, this would grow quickly given more packages. In the below benchmarks the minimum number of packages was 10,000. Likewise, while it may be difficult to calculate the number of valid combinations with more complex constraints, a brute force method would still require generating them to check their validity. Therefore this method is not worth of further consideration given it is impractical.

Greedy Algorithms

A greedy algorithm is a selection based algorithm that will attempt to choose the best candidate from U. A test is applied to ensure the set remains valid with the addition of the candidate. Next, if the candidate is valid it is placed into the current set. Finally, the process is then repeated until there are no more candidates. There are several ways no more candidates may be available such as: all available candidates have been included or rejected, or some other limiting principle (such as time). The pseudocode for this would be the following:


U = {...}
S_t = {}
loop
  u := best candidate from U
  if valid candidate u for S_t
    add u into S_t
    remove u from U
  else
    break
  end if

  if |U| < 1 or total_run_time() > max_run_time
    break
  end if
end loop

Of course, how the best candidate is selected from U would be implementation specific. Likewise, a potential limiting factors were defined:

The total run time to some arbitrary max_run_time.

This limiting factor is optional and may be replaced with alternative ones or removed entirely. However, as noted above the candidate selection process is the most important part as it remains implementation specific. If the selection process is complex (e.g. O(n^2)), the complexity would be performed on each selection negatively impacting the performance. As an expensive search would need to be performed for each candidate, at worst n times.

A common solution to this problem is to rank the candidates and then sort them from best to worst. With the sorted list, selecting best candidate would involve simply choosing the next item in the ordered list. Since sorting techniques are quite well known and can be performed efficiently, one is able to sort once and then simply iterate through the list. The general solution can be updated for cases were sorting is applicable:


U = {...}
S_t = {}
sort(U)
for each u in U
  if valid candidate u for S_t
    add u into S_t
    remove u from U
  else
    break
  end if
  if run_time > max_run_time
    break
  end if
end for

Simple Greedy Example

Now consider the problem defined above, it is time to implement the actual solution. For simplicity, only the first 2 constraints and the optimization function will be considered initially. First, the primary types are defined as:


using Amount = std::uint64_t;

struct Item {
  Amount value;
  Amount weight;
  Amount volume;
  string manufacturer;
  string type;
};

using Group = vector<Item>;

struct Parameters {
  Amount max_weight;
  Amount max_volume;
};

Item represents a single item in U or S_i. Where as Parameters will simply hold the various maximum or minimum constraint thresholds. Each numerical value is represented as a 64-bit Integer. If needed, floating point values could be substituted in-place.

Let us next restate the two constraints.

Limit the cumulative weight of S_i. ...
Limit the cumulative volume of S_i. ...

These constraints can be written as:


constexpr Amount sum_weight(const Group & selected) {
  return accumulate(selected.cbegin(), selected.cend(), static_cast<Amount>(0),
    [](const Amount & a, const Item & b) {
      return a + b.weight;
    }
  );
}

constexpr bool above_weight(const Item & item, const Group & selected, const Amount & max_weight)
{
  return sum_weight(selected) + item.weight > max_weight;
}

constexpr Amount sum_volume(const Group & selected) {
   return accumulate(selected.cbegin(), selected.cend(), static_cast<Amount>(0),
    [](const Amount & a, const Item & b) {
      return a + b.volume;
    }
  );
}

constexpr bool above_volume(const Item & item, const Group & selected, const Amount & max_volume)
{
  return sum_volume(selected) + item.volume > max_volume;
}

Next, the main loop section can be defined. First, the input U is presumed to be correctly ordered from best to worst candidate. Each iteration the best candidate will be checking to ensure that adding of the item to the set will adhere to the given constraints.


Group find_grouping(const vector<Item> & items, const Parameters & params) {

  Group selected;
  for(auto item : items)
  {
    if (above_weight(item, selected, params.max_weight) ||
        above_volume(item, selected, params.max_volume))
    {
      continue;
    }

    selected.push_back(item);
  }
  return selected;
}

If the candidate fails the constraints test it will be skipped and the next candidate will be tested. This process is repeated until all items in U have been tested. With the main loop finished, the only remaining task is to ensure the items in U are sort in the correct order of best to worst candidate. This leads to an obvious question; what makes a good candidate? One could pick at random, this may help and provide a unique selection process execution. However, such a selection process is unlikely to yield an optimal solution. The optimization objective, to maximize the shipment value, provides a good start point for what constitutes a optimal candidate.

...

If higher value packages were to consider first, it would likely yield a higher value shipment set S_i. The C++ standard libary's sort function facilitates a custom sort by function. To sort by highest value the following sort by function can be used:


constexpr bool sort_by_filter(const Item & a, const Item & b) {
  return a.value > b.value;
}

...

sort(items.begin, items.end(), sort_by_filter);

This will order the items such that the highest valued items come first. However, what is the result if a set of items contained the following entries:


[
  {
    "value": 10,
    "volume": 10,
    "weight": 10
  },
  {
    "value": 10,
    "volume": 1,
    "weight": 1
  },
  {
    "value": 10,
    "volume": 9,
    "weight": 9
  }
]

Given our sort works in-place, the first item would be selected. If CW_{max} = 10 then the result would be just the first item with a total value of 10. However, a more ideal subset exists the second 2 items were grouped resulting in a total value of 20. This means the simple sorting by value is insufficient, and that other parameters should be taken into consideration for more optimal groups to be formed. The following would be the most ideal ordering of best to worst candidate:


[
  {
    "value": 10,
    "volume": 1,
    "weight": 1
  },
  {
    "value": 10,
    "volume": 9,
    "weight": 9
  },
  {
    "value": 10,
    "volume": 10,
    "weight": 10
  }
]

One could alternatively sort by each of the three parameters:


constexpr bool sort_by_filter(const Item & a, const Item & b) {
  auto res = a.value <=> b.value;
  if (res < 0) {
    return false;
  } else if (res > 0) {
    return true;
  }

  res = a.weight <=> b.weight;
  if (res < 0) {
    return false;
  } else if (res > 0) {
    return true;
  }

  res = a.volume <=> b.volume;
  if (res < 0) {
    return false;
  } else if (res > 0) {
    return true;
  }
  return false;
}

This would ensure the volume and weight are taken into consideration for the ordering. However, this sorting would still lead to poor ordering in certain cases such as the following:


[
  {
    "value": 10,
    "volume": 9,
    "weight": 9
  },
    {
    "value": 10,
    "volume": 10,
    "weight": 10
  },
  {
    "value": 9,
    "volume": 1,
    "weight": 1
  },
  {
    "value": 9,
    "volume": 3,
    "weight": 4
  },
  {
    "value": 8,
    "volume": 2,
    "weight": 1
  }
]

Using the same limit (CW_{max} = 10), the group would end up with a total value of 19. However, there exists a better grouping that would yield a total value of 26. The determination of a more ideal ordering algorithm is left to a motivated reader (maybe value / (weight + volume)). While a more complex sorting method may help with this simplified example, the addition of the more complex constraints will require certainly frustrate this effort.

The full code example can be found on GitHub and can be run using the following commands:


cd e1/
mkdir -p build
cd build
cmake ..
make
./greedy_example_1

The output of the first example should be:


Valid Parameters
Value: 20
Weight: 19
Volume: 0

Chosen:
[
    {
        "value": 10,
        "volume": 0,
        "weight": 10
    },
    {
        "value": 10,
        "volume": 0,
        "weight": 9
    }
]

Extended Greedy

Building on the previous example, the omitted constraints are added back in. The following are those constraints re-stated:

The minimum value of a shipment must exceed CV_{min}. ...
The highest value package within the shipment cannot exceed CPV_{max} as a percentage of the total value of the group. ...
The aggregated value per product type cannot exceed CPT_{max} as a percentage of the total value of the group. Where PT_i is the list of groupings of S_i by product type. ...
The aggregated value per manufacturer cannot exceed CPM_{max} as a percentage of the total value of the group. Where M_i is the list of groupings of S_i by manufacturer. ...

The list of parameters is extended to include the additional constraint thresholds. As most are percentages, the threshold will be stored as double.


struct Parameters {
  Amount max_weight;
  Amount max_volume;
  Amount min_value;
  double high_value_max;
  double high_man_max;
  double high_type_max;
};

The 3rd constraint proves rather simple to implement. Since the items are sorted from best to worst candidate to maximize the value of the group S_i, the constraint need only be checked after the S_i is created. This is because if this constraint is not satisfied with the highest valued group, then no other set would be suitable either.


if(total < params.min_value)
{
  return {};
}

The next 3 constraints (4 - 6) present a greater challenge to implement. Looking at the 4th constraint, the numerator is max_{S_i}(s_i.value) and a denominator sum_{j=1}^{|S_i|}(s_j.value). Both are related to set S_i. The current max value could be maintained easily by recording the first value as he max, and updating the max value when a higher value is included in the group. However, there is an inherent flaw with attempting to implement these 3 constraints. Namely, that for a practical constraint threshold (< 1.0) the first check for the constraint would fail. Once again, take constraint 4, which on the first candidate is say s_1 The check would be as follows:

...

Likewise, constraint 4 would continue to fail until sufficient items can be added such that the highest valued item no longer exceeds CPV_{max}. Furthermore, the best candidate is not always the item with the highest value. Therefore, at a later stage a new higher value item may be tested for inclusion and fail the constraint check since its value as a percentage of the group exceeds CPV_{max}.

This would also apply for constraints 5 and 6, however unlike constraint 4 which may become valid again with the inclusion of more items. Constraint 5 and 6 will need the inclusion of certain types of items. Specifically, items with different product types and manufacturers respectively.

One possible solution would be to use a grace period. This grace period would tentatively add entries to the set S_i and check l more items. Each time checking if that constraint remains violated. If the constraints remain violated after the grace period, then the items tentatively added would be rejected and the iteration would continue from 1 position after the initial item that failed the check (j - l + 1). Otherwise, if the set meets the constraints then the tentatively added items will be added and the process will continue from the last added item.

This method could also apply to constraints 5 and 6. However, since the ordering does not take into consideration the product type or manufacturer a longer grace period look ahead (l) may be necessary. This would greatly reduce the efficiency and increase complexity.

Since even the look ahead already provides a less than desirable solution, instead for each of these constraints a simple check after the fact will be used. Likewise, even if a valid solution is found, it may not even be optimal. The implementation of a grace period will remain speculative and will be left for motivated reader to investigate. Below is a what our final greedy algorithm would look like.


Group find_grouping(const vector<Item> & items, const Parameters & params) {

  Group selected;

  std::unordered_map<string, Amount> product_manufacturer_map;
  std::unordered_map<string, Amount> product_type_map;
  Amount total = 0;

  for(auto item : items)
  {
    if (above_weight(item, selected, params.max_weight) ||
      above_volume(item, selected, params.max_volume))
    {
      continue;
    }

    Amount man_total = 0;
    auto product_manufacturer_value_it = product_manufacturer_map.find(item.manufacturer);
    if (product_manufacturer_value_it != product_manufacturer_map.end()) {
      man_total = product_manufacturer_value_it->second;
    }

    Amount type_total = 0;
    auto product_type_value_it = product_type_map.find(item.type);
    if (product_type_value_it != product_type_map.end()) {
      type_total = product_type_value_it->second;
    }

    selected.push_back(item);
    total += item.value;
    product_manufacturer_map[item.manufacturer] = man_total;
    product_type_map[item.type] = type_total;
  }

  if(total < params.min_value)
  {
    return {};
  }

  auto max_value = std::max_element(selected.cbegin(), selected.cend(), [](auto a, auto b) {
    return a.value < b.value;
  });
  if(max_value->value > static_cast<Amount>(params.high_value_max * static_cast<double>(total))) {
    return {};
  }

  for(auto [key, man_total] : product_manufacturer_map)
  {
    if (man_total > static_cast<Amount>(params.high_man_max * static_cast<double>(total))) {
      return {};
    }
  }

  for(auto [key, type_total] : product_type_map)
  {
    if (type_total > static_cast<Amount>(params.high_type_max * static_cast<double>(total))) {
      return {};
    }
  }

  return selected;
}

While severely limited due to the difficulty of constraints 4 - 6 the example will still produce output for cases where it can find solutions. The full example can be build and run using the following commands:


cd e2/
mkdir -p build
cd build
cmake ..
make
./greedy_example_2

This will output the following:


Valid Parameters
Value: 15
Weight: 19
Volume: 16
Man Types: 0.666667
Prod Types: 0.666667

Chosen:
[
    {
        "manufacturer": "a",
        "type": "p1",
        "value": 10,
        "volume": 10,
        "weight": 10
    },
    {
                "manufacturer": "c",
        "type": "p4",
        "value": 3,
        "volume": 2,
        "weight": 5
    },
    {
        "manufacturer": "c",
        "type": "p3",
        "value": 2,
        "volume": 4,
        "weight": 4
    }
]

There is an example input set included that can be used as follows:


./greedy_example_2 ../bad_case.json ../bad_case_params.json

This will produce a result, however that result will be invalid due to a violation of the sixth constraint.


Invalid Parameters
Value: 21
Weight: 20
Volume: 17
Man Types: 0.857143
Prod Types: 0.571429

Chosen:
[
    {
        "manufacturer": "p1",
        "type": "a",
        "value": 9,
        "volume": 10,
        "weight": 10
    },
    {
        "manufacturer": "p1",
        "type": "c",
        "value": 6,
        "volume": 4,
        "weight": 4
    },
    {
        "manufacturer": "p1",
        "type": "c",
        "value": 3,
        "volume": 2,
        "weight": 5
    },
    {
        "manufacturer": "p2",
        "type": "c",
        "value": 3,
        "volume": 1,
        "weight": 1
    }
]

This case highlights the limitations of the greedy algorithm in facilitating more complex constraints for partitioning items into sets S_i.

Artificial Intelligence Limitations

Another common approach to solving complex problems is the use of Artificial Intelligence (AI). However, for our problem AI solutions are generally not suitable for several reasons.

The input items to an AI system are often assumed to be independent of each other. However, in this partitioning problem, each constraint involves multiple items. For example, constraint 1 states that the sum of weights must not exceed a certain threshold. This introduces interdependency between items as the weight of one item can affect the selection of other items.
An AI system learns to produce the correct output through exposure to a large sample of training data, or via a reward function in the case of reinforcement learning. It cannot be directly informed by the well defined mathematical formulas. While, generating sample input data could be rather simple in Crystal:
```
items = Array(Item).new

num_items.times do |x|
  items << Item.new(
    rand(value_range),
    rand(weight_range),
    rand(volume_range),
    prod_type_list[rand(num_prod_types)],
    man_type_list[rand(num_man_types)])
end
```
This data may not be the most representative of the data one may actually receive in a production environment. Likewise, while generating input data may rather trivial, correctly producing corresponding output data given the input would pose a significant challenge given the complexity of our problem. Naturally, this makes supervised learning more difficult. One could seek to use unsupervised given the following reward function. ...
AI models can be fragile to changes in the problem. If any of the constraint thresholds (e.g. CW_{max}) were to change the model would need to be re-trained to support the update. Effectively, the model would be locked to a specific set of constraints and constraint thresholds.

You can find the full example input data generator referenced above on GitHub.

Constraint Programming Solver

Another lesser known approach is Constraint Programming and Linear optimization includes Satisfiability (SAT), Constraint Programming (CP) Solvers, and Mixed Integer Programming (MIP) Solver. These solvers provide a systematic approach to defining and solving problems with constraints and optimization objectives. The foundation for these solvers is mathematical based and therefore the interface is more similar to mathematical formula definitions than programming. While the more mathematical definitions may prove challenging for some, they provide a uniquely powerful definition interface allowing for complex problems to be resolved. A solver will try to maximize what's known as "branch pruning" to limit the available search space and based on the defined constraints. In doing so, the search space of a complex problem can be greatly reduce. SAT solvers represent a simple foundation which CP solvers are build upon. A SAT solver is build in boolean space, whereas a CP Solver extends to Integer space. Finally, a MIP Solver may use a mixture of integers and decimals depending on the implementation.

SAT Solver

SAT solver is a formally defined as a boolean satisfiablity problem solver. Given the definition of as set of variables V and a boolean formula f_b(V) the SAT solver searches for valid solutions of the boolean formula. A valid solution is defined as a set of values of V for which f_b(V) = true. Therefore there may be 1, many, or no solutions given the formula.

...

Boolean — A variable which can be expressed as 2 values typically true or false.

Boolean Formula — A formula using boolean variables and operators (e.g. -, v, ^).

For example, one could define a simple formula in terms of the following variable definitions:

... ...

Which would have only a single solution:

x = true, y = false>

The following outlines reducing the formula using the solution values. It resolves to true and therefore meets our criteria for a valid solution.

... ... ...

Whereas the following formula would have many solutions:

... ...

x = true, y = true
x = true, y = false
x = false, y = true

Alternatively, the following formula would have no solution:

... ...

SAT solvers can solve far more complex boolean formulas and while useful, must be defined boolean space. This limitation requires our problems be defined in terms of boolean variables and boolean functions. While the boolean space may be adapted to a wide variety of problems (as computers themselves use boolean logic at the lowest level). It can prove difficult and tedious to reason with and formulate problems. Ideally, in similar fashion as computers do, a higher level of abstraction can be used to facilitate an easier to reason with interface to formulate more complex problems.

CP Solver

A CP Solver is similar to a SAT solver and is comprised of 3 main input components:

Variables — V = {v_1, v_2, ..., v_{|V|}}
Constraints — C = {c_1, c_2, ...,c_{|C|}}
Objectives — O = {o_1, o_2, ..., o_{|O|}}

However, unlike SAT solvers, the variables V are defined within the Integer domain. Each variable will need to be defined with a lower and upper bound of permitted values.

... ...

Constraints C are likewise defined as a linear formula comprised of constants and variables V. One can define any number of constraints as desired. Constraints are defined as either <,≤,=,≠,>,≥. Given the following variables, a couple valid constraints would be:

... ... ...

C_{e_1}: x + 1 ≤ y - 5
C_{e_2}: x ≠ y

As stated, these constraints must remain in the linear domain such that no multiplication or division between variables is permitted. There are extensions that may permit searching beyond linear space, however that topic is out of scope of this discussion. Despite this limitation, rather complex constraints are able to be defined.

Since a CP Solver is confined to Integer domain, extra care is required in variable definition and constraints formation. Therefore, if a fractional variable or constant is require, some domain transformation will be needed. For example, given the constraint:

...

All the constants can be converted to integers by simply multiplying by 12 (the Lowest Common Denominator). Resulting in the following:

...

As for variables, a similar transformation can be performed to adjust the their domain. For example, given a variable has a domain: x in [0.1, 5.6]. Multiplying the variable domain by 10 would pull the variable domain into the Integer domain. This would result in x redefined as x in [1, 56]. One would need to remain cautious regarding value precision. If the expected value of x was 1.45, the transformation would make this 14.5. However since as this is entirely in the Integer domain the actual value would be truncated to 14. Therefore, the transformation is not merely about converting the upper and lower bounds of a variable's domain. But also the tolerated level of precision after which values will be truncated. Secondly, if this transformation is applied to 1 variable, it will likely be necessary to apply this transformation to all variables and constraints to maintain the variable relationships. A general rule if fractional numbers may be present in your problem is to apply a large scaling factor (e.g. 1000) to all constraints. For example the above equation would be adjusted as the following (keeping in mind the domain of y and z will be adjusted by a factor of 10 as well):

...

Finally, the objectives O can be defined. They are very similar to a constraints except an objective will not contain an inequality. Instead one will choose to optimize target for the formula e.g. minimize or maximize. For example one could define a optimization objective as:

...

Objectives are quite similar to constraints, such that all the same restrictions apply (e.g. Integer domain). To decide whether to use a constraint or a objective one should rely on their domain expertise and determine whether the formula presents a limit (e.g. should be no greater than) or is a primary goal (e.g. should be as small as possible). In the case of the former, the formula should be a constraint, as for the later, an objective.

CP Solver Example

With that in mind, let us proceed with using applying a CP Solver to the simplified version of the example problem. First, the variables for this problem are defined. For this problem and for grouping problems generally 1 variable is defined per item in U.

...

Each variable has a domain of v_i in [0,1]. The classification of the variable values would be as follows: v_i = 0 indicates that u_i is not part of the current set. Conversely, v_i = 1 indicates that u_i is a member of the current set. The specific reason for this classification will become more apparent shortly during the constraint definition. The variable domain definition is done as follows:


Group find_grouping(const vector<Item> & items, const Parameters & params) {
    CpModelBuilder model_builder;
    vector<IntVar> within_pool(items.size());
    const Domain domain_from_zero(0, 1);
    for(std::size_t i = 0; i < items.size(); ++i)
    {
        within_pool[i] = model_builder.NewIntVar(domain_from_zero);
    }
    // ...
}

With the variables define, the next step is to define the constraints. As with the greedy algorithm, this first example will only consider the first 2 constraints and the optimization objective.

The first constraint restated: ... While this constraint is valid, in order to be useful for a CP Solver the constraint must be transformed. The current constraint is written in terms of S_i (which is defined as output). Whereas the it should be rewritten in terms of U and V (the input). While the initial formula was illustratively useful, in order to effectively determine the output, one must define it based on the input. Naturally, this transformation will need to maintain a similar meaning in order to maintain the original purpose (i.e. limiting the weight of the generated group S_i). Since V is defined as whether each item is within the set, one could effectively determine whether u_i is within the set by checking v_i. In this case, the following 2 simple multiplication principles will be used: ... ... This means, the value v_i times u_i.weight will either be u_i.weight or 0 depending on whether the item is within the set or not. The constraint can therefore be re-written as: ... All items that are in the set will resolve to 1 times u_i.weight where as all terms not within the set will resolve to 0 times u_i.weight. Finally, since a summation is done of the individual terms, a term of 0 will not impact the final result. This simple, yet powerful transformation will therefore be applied to the other constraints as well.
The second constraint can be re-written as follows: ...

The updated first and second constraint in code:


Group find_grouping(const vector & items, const Parameters & params) {
  CpModelBuilder model_builder;

  const int64 scaling_factor = 1000;

  vector<IntVar> within_pool(items.size());
  vector<int64> weight_scaled(items.size());
  vector<int64> volume_scaled(items.size());

  const Domain domain_from_zero(0, 1);
  for(std::size_t i = 0; i < items.size(); ++i)
  {
    within_pool[i] = model_builder.NewIntVar(domain_from_zero);
    weight_scaled[i] = static_cast<int64>(items[i].weight) * scaling_factor;
    volume_scaled[i] = static_cast<int64>(items[i].volume) * scaling_factor;
  }

  // 1. SUM(v_i x u_i.weight) <= w_max
  model_builder.AddLessOrEqual(
    LinearExpr::WeightedSum(within_pool, weight_scaled),
    static_cast<int64>(params.max_weight) * scaling_factor);

  // 2. SUM(v_i x u_i.volume) <= v_max
  model_builder.AddLessOrEqual(
    LinearExpr::WeightedSum(within_pool, volume_scaled),
    static_cast<int64>(params.max_volume) * scaling_factor);

  //...
}

Note, that as mentioned above a scaling factor may be necessary if floating point numbers are used to convert them into the Integer domain. In this case a scaling factor of 1000 is applied to both sides of the inequality. While this scaling factor may not be strictly necessary in this case, it will be necessary in the later example and thus is easier to add from the start. One could could easily nullify it by setting the scaling factor to 1.

Finally, the optimization objective function must also be re-written in terms of the input rather than the output:

...

Similar to the above constraints the optimization objective is defined as:


Group find_grouping(const vector<Item> & items, const Parameters & params) {
  CpModelBuilder model_builder;

  const int64 scaling_factor = 1000;

  vector<IntVar> within_pool(items.size());
  vector<int64> value_scaled(items.size());

  const Domain domain_from_zero(0, 1);
  for(std::size_t i = 0; i < items.size(); ++i)
  {
    within_pool[i] = model_builder.NewIntVar(domain_from_zero);
    value_scaled[i] = static_cast<int64>(items[i].value) * scaling_factor;
  }

  // MAX(SUM(v_i x u_i.value))
  model_builder.Maximize(LinearExpr::WeightedSum(within_pool, value_scaled));

  //...
}

Once all the variables, constraints, and objective function have been defined, the model can be setup to be run. First, the model is set to have at most 4 workers and only run for at most 3 minutes. These limiting factors can be adjusted or ignored as needed. The following code shows how these limits can be added:


Model model;

SatParameters parameters;
parameters.set_num_search_workers(4);

int max_time = 3 * 60;
parameters.set_max_time_in_seconds(max_time);
model.Add(NewSatParameters(parameters));

Next, the CP Solver can be run to look for solutions. By default the solver will search for the first best solution and stop once it finds a solution.


const CpSolverResponse response = SolveCpModel(model_builder.Build(), &model);

After, the CP Solver is run it will return a response. The status of the response will identify whether a solution was found. A status of Optimal or Feasible will indicate a solution is found. All other statuses will indicate no solution is found. In the case of Optimal the CP Solver has determined the solution is the best possible. Whereas Feasible means the solution satisfies the constraints but is not necessarily the best solution possible. One can extract a solution from the response as follows:


if (
  response.status() != CpSolverStatus::OPTIMAL &&
  response.status() != CpSolverStatus::FEASIBLE)
{
  // no solution
  return {};
}

for (size_t i = 0; i < items.size(); ++i)
{
  if (SolutionIntegerValue(response, within_pool[i]) >= 1)
  {
    selected.push_back(items[i]);
  }
}
return selected;

If the status indicates there is a valid solution, it can then be retrieve using the value of each variable in V to determine whether v_i is 0 or 1. Again, when v_i = 1 then the item (u_i) is included in the group. Otherwise the item is excluded from the group. The full example can be run using the following:


cd e3/
mkdir -p build
cd build
cmake ..
make
./cpsolver_example_1

The result of which will be the following:


Resp Status: OPTIMAL
Valid Parameters
Value: 15
Weight: 19
Volume: 16

Chosen:
[
    {
        "value": 10,
        "volume": 10,
        "weight": 10
    },
    {
        "value": 3,
        "volume": 2,
        "weight": 5
    },
    {
        "value": 2,
        "volume": 4,
        "weight": 4
    }
]

CP Solver Example Full

Let us proceed with adding in the additional 4 constraints (constraints 3 - 6). Similar to the previous example, these constraints will need to be converted from output to input definitions. The updated constraints are as follows:

The minimum value of a shipment must exceed CV_{min}. ... This can be defined as:


// 3. SUM(v_i x u_i.value) > v_min
model_builder.AddGreaterThan(
  LinearExpr::WeightedSum(within_pool, value_scaled),
  static_cast<int64>(params.min_value) * scaling_factor);

The highest value item within the shipment cannot exceed CPV_{max} as a percentage of the total value of the group. This is more difficult since determining the max value of the output using the input is more difficult. One potential solution is to simply force the maximum value into the group by definition. ... Here the max value in U is found in advance and the domain for that variable is set as [1, 1] ensuring the value is placed in the pool. This can be done as follows:
```
auto max_value = std::max_element(items.cbegin(), items.cend(), [](auto a, auto b) {
  return a.value < b.value;
});

auto max_index = static_cast<size_t>(std::distance(items.begin(), max_value));
const Domain domain_from_one(1, 1);

for(size_t i = 0; i < items.size(); ++i)
{
  // ...
  if (i == max_index) {
    within_pool[i] = model_builder.NewIntVar(domain_from_one);
  }
  // ...
}

// 4. MAX(v_i x u_i.value) / SUM(p_i.value x c_i) < high_value_max
// Rewritten as: MAX(v_i x u_i.value / high_value_max) < SUM(v_i x u_i.value)
model_builder.AddLessOrEqual(
  LinearExpr::Term(
    within_pool[max_index],
    static_cast<int64>(max_value->value) *
      static_cast<int64>(static_cast<double>(scaling_factor) / params.high_value_max)),
  LinearExpr::WeightedSum(within_pool, value_scaled));
```
While this will find valid solutions, it will limit the solver to only consider a subset of solutions. Specifically, ones that contain the current max value item of U. An alternatively approach would be to loosen the constraint and instead apply the constraint for all items in U. This effectively creates a new set of constraints, 1 per item in U. ... This may provide a reasonable replacement for constraint 4. Given v_i.value > v_j.value and assuming i and j are both within the set, the higher value (v_i) will always yield a higher result when divided by the same denominator. This means that if a lower value (v_j) in S_a were to fail this inequality then the higher value (v_i) would also fail. And conversely, if the highest value in S_a succeeds all other values in the set would also succeed. This is of course quite inefficient when one has a large set of items U since it will require an additional constraint per item in U (|U|). The following code implements the loosened version of constraint 4:
```
for(size_t i = 0; i < items.size(); ++i)
{
  model_builder.AddLessOrEqual(
    LinearExpr::Term(within_pool[i], static_cast<int64>(items[i].value) *
      static_cast<int64>(static_cast<double>(scaling_factor) / params.high_value_max)),
    LinearExpr::WeightedSum(within_pool, value_scaled));
}
```
There is of course another way that to define this constraint. By defining a new variable and making use of MaxEquality constraint. The new variable is then bound by the MaxEquality to be the max item within the provided set, therefore the new variable can be used to define constraint 4. First, one must define the list of v_i times u_i.value. The new variable max_value_var is created with a domain of [0,max_{U}(u_i.value)]. Using MaxEquality, max_value_var is set to max_{U}(v_i times u_i.value). Then max_value_var is used in place of the max item in the constraint.
```
vector<LinearExpr> val_sets(items.size());

// ...
for(size_t i = 0; i < items.size(); ++i)
{
  // ...
  val_sets.push_back(LinearExpr::Term(within_pool[i], static_cast<int64>(items[i].value)));
}
// ...

const Domain domain_max_var(0, static_cast<int64>(max_value->value));
auto max_value_var = model_builder.NewIntVar(domain_max_var);

// Set max_value_var == max(val_sets)
model_builder.AddMaxEquality(max_value_var, val_sets);

// 4. MAX(v_i x u_i.value) / SUM(p_i.value x c_i) < high_value_max
// Rewritten as: MAX(v_i x u_i.value / high_value_max) < SUM(v_i x u_i.value)
model_builder.AddLessOrEqual(
  LinearExpr::Term(max_value_var,
    static_cast<int64>(static_cast<double>(scaling_factor) / params.high_value_max)),
  LinearExpr::WeightedSum(within_pool, value_scaled));
```
While this does work, it could potentially increase the complexity of the search space to such an extent that it becomes prohibitive. In such a cases, the original solution of forcing the inclusion of the max item into S_i may be appropriate. For this example all 3 approaches are included for completeness. Changing the variable constraint_four_setting will alternate between the three different methods.

The aggregated value per product type cannot exceed CPT_{max} as a percentage of the total value of the group. Where PT_i is the list of groupings of U by product type. ... As this constraint applies for each product type, 1 constraint would be required per product type. Naturally, with a large number of product types this constraint may become unfeasible. The following code defines the constraint:


set<string> product_types;

for(size_t i = 0; i < items.size(); ++i)
{
  // ..
  product_types.insert(items[i].type);
}

// ..

// 5. SUM(v_i x u_i.value if p_i.product_type == type) / SUM(v_i x u_i.value) <= type_value_max
// Rewritten as: SUM(v_i x u_i.value / type_value_max if p_i.product_type == type) <= SUM(v_i x u_i.value)
for (auto prod_type : product_types)
{
  vector<IntVar> product_type;
  vector<int64> coeff;
  for(size_t i = 0; i < items.size(); ++i)
  {
    if (items[i].type != prod_type)
    {
      continue;
    }

    product_type.push_back(within_pool[i]);
    coeff.push_back(
      static_cast<int64>(items[i].value) * static_cast<int64>(
        static_cast<double>(scaling_factor) / params.high_type_max));
  }

  model_builder.AddLessOrEqual(
    LinearExpr::WeightedSum(product_type, coeff),
    LinearExpr::WeightedSum(within_pool, value_scaled));
}

The unique set of product types is collected in product_types. Then for each unique product type, a list of the items for that product is collected. The constraint is then created for the individual product type.

The aggregated value per manufacturer cannot exceed CPM_{max} as a percentage of the total value of the group. Where M_i is the list of groupings of U by manufacturer. ... Similar to product type constraint (5), the manufacturer constraint will require 1 constraint for each manufacturer. Once again, an excessive number of unique manufacturers could increase the complexity of the problem. The constraint set can be created as follows:


set<string> manufacturer_types;
for(size_t i = 0; i < items.size(); ++i)
{
  // ..
  manufacturer_types.insert(items[i].manufacturer);
}

// ..

// 6. SUM(p_i.value x c_i if p_i.manufacturer == manufacturer) / SUM(p_i.value x c_i) <= man_value_max
// Rewritten as: SUM(p_i.value / man_value_max x c_i if p_i.manufacturer == manufacturer) <= SUM(p_i.value x c_i)
for (auto manufacturer_type : manufacturer_types)
{
  vector<IntVar> man_value;
  vector<int64> coeff;
  for(size_t i = 0; i < items.size(); ++i)
  {
    if (items[i].manufacturer != manufacturer_type)
    {
      continue;
    }

    man_value.push_back(within_pool[i]);
    coeff.push_back(
      static_cast<int64>(items[i].value) * static_cast<int64>(
        static_cast<double>(scaling_factor) / params.high_man_max));
  }

  model_builder.AddLessOrEqual(
    LinearExpr::WeightedSum(man_value, coeff),
    LinearExpr::WeightedSum(within_pool, value_scaled));
}

The full code example can view on GitHub and run using the following:


cd e4/
mkdir -p build
cd build
cmake ..
make
./cpsolver_example_2

The output of which will be:


Resp Status: OPTIMAL
Valid Parameters
Value: 18
Weight: 18
Volume: 15
Man Types: 0.5
Prod Types: 0.666667

Chosen:
[
    {
        "manufacturer": "a",
        "type": "p1",
        "value": 9,
        "volume": 10,
        "weight": 10
    },
    {
        "manufacturer": "c",
        "type": "p1",
        "value": 3,
        "volume": 2,
        "weight": 5
    },
    {
        "manufacturer": "c",
        "type": "p2",
        "value": 3,
        "volume": 2,
        "weight": 2
    },
    {
        "manufacturer": "c",
        "type": "p2",
        "value": 3,
        "volume": 1,
        "weight": 1
    }
]

MIP Solver

As a bonus to highlight the difference between the CP SAT Solver interface and the MIP Solver interface provided by OR-Tools this additional solution is available. You can find a full example of the same problem solved using the MIP Solver interface. During the implementation 2 limitations were found regarding the MIP Solver.

One cannot easily perform only greater than or less than. So constraint 3, was changed to be ≥ instead of simply >.
No facility was provided to perform a MaxEquality for constraint 4 as was done in the CP SAT Solver. Therefore only support for forcing the max valued item into the group, and the loosened max constraint was provided.

Otherwise, the remaining functionality was generally comparable.


cd e5/
mkdir -p build
cd build
cmake ..
make
./cpsolver_example_3

This should output the following:


Valid threads
Resp Status: MPSOLVER_OPTIMAL
Valid Parameters
Value: 18
Weight: 18
Volume: 15
Max Percent of total: 0.5
Man Types: 0.5
Prod Types: 0.666667

Chosen:
[
    {
        "manufacturer": "a",
        "type": "p1",
        "value": 9,
        "volume": 10,
        "weight": 10
    },
    {
        "manufacturer": "c",
        "type": "p1",
        "value": 3,
        "volume": 2,
        "weight": 5
    },
    {
        "manufacturer": "c",
        "type": "p2",
        "value": 3,
        "volume": 2,
        "weight": 2
    },
    {
        "manufacturer": "c",
        "type": "p2",
        "value": 3,
        "volume": 1,
        "weight": 1
    }
]

Benchmarks

Ideally, it would be possible to directly compare the greedy approach to the CP Solver or MIP Solver. However, a simple comparison between run time is insufficient since the output is an important factor in the value of a specific method. For example the greedy algorithm often finishes much faster. For a 10,000 items the greedy algorithm took 0.027s, but had a total value of 400. Whereas the CP Solver for the same items list took on average 19.63s and found an optimal solution with a value 4888 (over 10 times better). Finally, the MIP Solver took 1.755s and also found an optimal solution with a value of 4888. One could compare the CP SAT Solver with the MIP Solver as both in this case were using approximately the same constraints and found the same optimal solution. However, when the CP SAT Solver is using the MaxEquality version of constraint 4 it will take on average 20.18s and result in a even better valued pool of 5099. Since this is not feasible for MIP Solver, then one would have to choose between speed and optimal output.

How can both the CP SAT Solver (when using the MaxEquality for constraint 4) and MIP Solver both find different but optimal solutions? As noted earlier, if the max valued item from U is forced into the group the search space is then restricted to only possible subsets that contain the max item. In this case the most optimal solution is one that does not contain the max valued from U.

Below is a list of benchmarks for each approach using the same sample input U. The results show that the greedy approach, while fast, is unable to find a valuable solution. Whereas the MIP Solver is both fast and able to find a reasonably valuable solution comparable with the CP SAT Solver using the same forced max version of constraint 4. The CP SAT Solver forced max version is slow and while it does find an optimal solution is easily beat in terms of speed by the MIP Solver. Finally, the CP SAT Solver with the MaxEquality is able to find a more optimal solution at the expense of a longer run time.

Finally, with regards to the max all (also called loosened max version) constraint definition for constraint 4. While the implementation will work for toy examples, it was unable to finish in a reasonable amount of time for the benchmark. For both the CP Solver and the MIP Solver, a 30 minute duration was provided. The CP Solver failed to find a solution in that duration. Where around the 20 minute mark the MIP Solver began to allocate more memory than the laptop could afford (26 GB of 32 GB was available) causing the OS eventually performed a memory kill on the program. While no solution was achieved, this does highlight the care one must take when implementing constraints. If one isn't careful a constraint that doesn't scale well could drastically slow down the program and even consume certain system resources entirely.

	Size	Status	Value	Time (seconds)
Greedy	10000	Feasible	400	0.027
CP Solver ForceMax	10000	Optimal	4888	19.63
CP Solver MaxEquality	10000	Optimal	5099	20.18
CP Solver MaxAll*	10000	Unsolved	0	1800
MIP Solver ForceMax	10000	Optimal	4888	1.755
MIP Solver MaxAll*†	10000	Unsolved	0	1200

	Size	Status	Value	Time (seconds)
Greedy	50000	Feasible	300	0.166
CP Solver ForceMax	50000	Optimal	6645	86.96
CP Solver MaxEquality	50000	Optimal	7133	169.23
MIP Solver ForceMax	50000	Optimal	6645	22.07

	Size	Status	Value	Time (seconds)
Greedy	100000	Feasible	300	0.269
CP Solver ForceMax	100000	Optimal	7205	110.89
CP Solver MaxEquality‡	100000	Optimal	8364	478.56
MIP Solver ForceMax	100000	Optimal	7205	32.62

* For this case the time limit was increased to 30 minutes to permit the solver to find a solution.

† Unable to complete as the available memory was completely exhausted and the program was killed.

‡ For this case the time limit was increased to 10 minutes to permit the solver to find a solution.

Conclusion

A distinct set of problems were presented, that given a universe of items U, finding an optimal subset S_i. The subset S_i must follow a set of user defined constraints, otherwise a subset is considered invalid and not a permissible candidate. Finally, an optimization function is used to locate the most suitable subset.

Given the nature of the complexity of the problem, it may not be possible to reliably determine an optimal solution. However, it would be ideal to at least attempt to find better solutions rather than just randomly organize items into a subset until the constraints are satisfied. To explore this an example problem is defined.

A greedy algorithm approach proves suitable for the simple set of constraints and objective function. This solution is rather easy to define and is able to efficiently find solutions. However, with the introduction of more complicated constraints the greedy approach becomes quite limited without complex changes. Sorting the candidate items no longer offered a reasonable means to locate the best candidates and so solutions become less optimal. In fact, in some cases no solution is readily available due to the inability to check the complex constraints during the iterative construction of the subset. While potential solutions may be feasible, they would introduce an increase in complexity and may still not find a suitable solution.

With a greedy algorithm shown to be ineffective, an alternative solution is introduced; CP Solvers. CP Solvers generally lend themselves towards mathematically defined problems. Likewise, they provide a useful platform for defining variables, constraints and optimization parameters. For the example problem, the CP Solver is able to implement all simple and complex constraints. Naturally, the CP solver does present challenges of its own which were touched upon. However, the benefit of a CP Solver is the output of the system will be guarantee to fulfill the constraints and be valid if it as found as a solution. Furthermore, the CP Solver can tell us whether the current solution is optimal, or whether there is even a feasible solution given the input items, and constraints.

CP Solvers have several key advantages:

Systematic and well defined platform for solving certain problems.
Constraint thresholds can easily be adjusted.
Highly scalable through configuring the number available workers.

As mentioned a CP Solver is not without it's drawbacks, some of which include:

Implementation can be rather difficult given complex/mathematical like definition.
A CP Solver while powerful must remain in linear space.
Complex or poorly constructed solutions may require exceedingly long search time.

Overall CP Solvers are a unique and powerful tool for solving challenging problems. As shown, for certain problems CP Solvers provide one of the few tools for effectively reaching a solution. While CP Solvers may appear a bit challenging, our team of experts are more than happy to assist you in solving your complex problem. Otherwise, if you have any questions please feel free to contact us.

Discuss on Hacker News, send us thoughts, or join the discussion below.

LDG

Optimal Subset Selection in C++ with OR-Tools Constraint Solvers

Optimal Partitioning with Constraint Solvers in C++.

Summary

Contents

Introduction

Code Examples

Background

Problem Definition

Brute Force

Greedy Algorithms

Simple Greedy Example

Extended Greedy

Artificial Intelligence Limitations

Constraint Programming Solver

SAT Solver

CP Solver

CP Solver Example

CP Solver Example Full

MIP Solver

Benchmarks

Conclusion