Computer-Aided Pension Investment Decision Making
Ron Surz, Managing DirectorRoxbury Capital Management
The technological revolution that characterizes our entrance into the twenty-first century will have a significant influence on the way pension plan sponsors make investment decisions. Computing power is now available on book-size computers that did not exist, even on huge mainframes, as little as 10 years ago. To describe how plan sponsors will benefit, this chapter describes new tools that take advantage of this computing power.
Changes to investment programs are made on the basis of new information or as a result of something not working. Evaluations of the contemplated changes assure that they are warranted. These evaluations fall into three broad categories: policy, policy management, and asset management.
For each of these three aspects of performance evaluation, we examine current practices and describe innovations that capitalize on the technological revolution.
The effects of investment policy are measured as the return and risk of a customized policy index. A policy index is constructed by first establishing benchmarks for each asset class, such as the S&P 500 for stocks or the Salomon Broad Index for bonds. These are then combined using the policy targets as weights to form a customized index. Unfortunately, this policy effect is rarely evaluated and, in fact, is almost never considered in performance evaluations. Rather, the difference between the policy return and the actual return is presented as the performance to be scrutinized, with the focus on manager influences. In actuality, most of this differential piece is due to policy management rather than asset management, as discussed in the next section.
To remedy this situation, policy should to be presented separately as an important and integral component of return, such as shown in the following table:
Components of Return
Also, policy is most appropriately evaluated relative to the investor's needs and risk tolerance. This is a subjective judgment that cannot be quantified. However, measuring progress toward the achievement of objectives is one way to help make this judgment. Aggressiveness should adapt as objectives are attained or missed. Policies are evaluated in light of their continuing suitability and likelihood of achieving objectives. They should not be evaluated relative to those of other sponsors because those sponsors have different needs and risk tolerances; that is, they're striving for different goals. Computer tools continue to be perfected to match needs and risk tolerances with the appropriate investment policy.
Investment policies provide latitude for drifts away from policy targets. Policy managers decide how investment programs operate within this latitude. For example, most plans operated near the high end of their acceptable equity allocation and the low end of their real estate allocation during calendar 1996. Variances from policy such as these dwarf manager effects, but are almost never measured and are evaluated even less frequently. Perhaps this is because the policy managers and the evaluators are usually the same people: investors and their consultants. Policy managers are sometimes referred to as "managers of managers." To effectively manage the managers, these decision makers must have a yardstick for measuring their own success or failure. Otherwise, they're navigating without a compass.
Measuring the policy management effect and its components can be complex because of the interactions between managers and sponsors. The first step in separating out the effects of these interactions is to identify the key investment decisions and who is responsible for them. There are three key decision-areas, all related to asset allocation: (1) managers, (2) asset types(stocks, bonds, etc.), and (3) individual securities. Responsibility for each decision area breaks out as follows:
Investment Decision Responsibility
Investment decisions are made by persons responsible for:
Accordingly, the sponsor should receive full attribution for how assets are allocated to managers, and the managers should receive full responsibility for the securities they select. The grey area arises in asset type allocation, since the sponsor knows how the managers are committed to asset types when he allocates assets to them. This grey area can be broken down into black and white areas with a proper methodology that recognizes that the effects of manager and sponsor decisions manifest themselves as performance differentials away from a neutral, or central, situation.
The tools for attributing performance to both sponsor and manager fall into three general categories:
These tools are finding their way into the toolbox of widely accepted disciplines.
Benchmarks are tools for measuring control states - what would have happened in the absence of a particular event. For example, the S&P 500 is frequently used as a benchmark for what would have happened in the absence of active management decisions. To achieve performance attribution, we need six such performance benchmarks. Two are directly sponsor-related, two are directly manager-related, and two are hybrids since they are related to both sponsor and manager decisions. We develop these performance benchmarks by having three control variables, against which we apply either the actual situation or a neutral situation. Since there are 3 x 2 (three variables times two "states of nature") such combinations, there are six benchmarks. The neutral situations are those developed by the plan sponsor in establishing investment policies; accordingly, they are called policy states. The control variables are (1) allocation of monies to the managers, (2) manager allocation to asset classes, and (3) asset class performance.
The form of each benchmark is:
The six performance benchmarks and their symbols (which we'll use throughout the rest of the chapter are as follows:
These benchmarks are used to calculate attribution measures, as described in the next section.
There are two forms of attribution measurement - timing and selectivity. Timing measures the effect of changing allocations, while selectivity measures the effect of deviating from passive, or index, management. Both forms measure the value added or subtracted relative to strict adherence to policy guidelines.
Some basic, or fundamental, measurements can be calculated and used to derive more complex, or derivative, measurements. The derivative measures serve the further delineate manager and sponsor effects. These measurements are as follows:
Performance Attribution Measurements
Although these relationships may seem complex, they provide a rich resource for those wishing to calculate performance attribution, as well as a variety of consistency checks. For example, the fact that the attribution components sum to the actual total return assures that the measurements are consistent with "the whole equaling the sum of its parts." Additionally, the sponsor and manager attribution measures are consistent with logic. The sponsor attribution measure of "S-s+P" says that the sponsor "owns" the investment policy (P) plus any performance differential resulting from non-neutral weightings of the managers (S-s). This makes sense because it captures what is directly under the sponsor's control. Similarly, the manager measure of "s-P" also makes sense since it attributes only the managers' discretionary deviations from policy to the managers.
Because this attribution is new and unique, unique ways for evaluating these measurements are appropriate, as discussed in the next section.
Evaluating the Measurements
The attribution measurements produce numbers which need to be interpreted. In the jargon of the statistician, the decision-maker must determine if the result is "significant" before he operates on the basis of that information. This section presents some ways to evaluate the following five key attribution measures:
Attribution Measures to Evaluate
The framework for evaluating these measures is the one used in classical statistics, where the decision maker estimates the probability that the observed result is due merely to chance, rather than to skill or the absence of skill. This framework can be applied to all of the measures except policy. To evaluate the policy effect, the sponsor needs to review his risk tolerances and diversification opportunities; policy can only be evaluated with regard to its appropriateness in light of the emerging needs of the sponsor's plan. The other measures are evaluated relative to the available opportunities. For example, the sponsor-timing measure, "T-t," is evaluated by developing the probability distribution of possible (T-t)s. This is straightforward, since the only variable in play here is the allocation among the managers. Accordingly, an opportunity range is developed that reflects the permissible ranges of allocations among the managers. For example, if there were two manager with permissible allocations of 30-70% in each, the evaluation distribution would encompass the performance outcomes from all possible allocations conforming to these guidelines (e.g., (30, 70), (40, 60), (50,50), etc.). For simplicity, all such allocations can be assumed to have an equal likelihood, or one could devise more complex allocation probabilities. The evaluation of sponsor timing is then accomplished by determining the ranking of the actual measurement relative to this opportunity set. Sponsor selectivity is also based on manager over/underweightings, and is evaluated against a similar opportunity set. Similarly, manager timing ("t-P") is evaluated by developing the opportunities set of possible asset class commitments within the managers' guidelines. This is best achieved by evaluating each manager individually.
The overall policy management effect "S-s" is evaluated relative to all opportunities afforded by policy guidelines. Accordingly, a policy management opportunity distribution is constructed as all possible allocations that could be made pursuant to policy guidelines. The policy management return is then put into perspective against this opportunity set. Exhibit 1 demonstrates this approach in risk/return terms. As can be seen, the example shows the evaluation of a policy management effect that has increased both return and risk during the period. The actual total return is also shown. The difference between actual return and policy management return is the asset management effect, which is discussed next.
Virtually all of our attention over the course of performance evaluation has been focused on evaluating managers, yet we still haven't gotten it right. Evaluation universes are loaded with well-documented biases, so our judgments of good or bad performance are frequently wrong. Nor have we succeeded in properly identifying the sources of return, which is the job of performance attribution. Attributions are dominated by complex factors that only ultra quants can understand.
So how do you know if your investment manager is doing a good job? The answer to this question relies upon the answer to another question: "Relative to what?" The usual method of evaluating equity performance is to compare it with a stock market index such as the S&P 500 or the Dow Jones. For reasons we'll describe, these indexes are not appropriate benchmarks for evaluating investment performance. Similarly, comparisons with the performance of similar portfolios can be highly misleading due to the significant biases inherent in such "peer groups." Proper performance evaluation requires an accurate and unbiased standard -- a standard that did not exist until recently. This new standard is a scientific approach called Portfolio Opportunity Distributions (PODs). PODs compare a portfolio's performance results against those of other portfolios wholly without management, responding to the "Relative to what?" question with the answer "Relative to all possible implementations of the portfolio is strategy." It is the distribution of "s" in the framework described above. The following explains the problems with current performance evaluation approaches and examines how the new standard overcomes these problems.
The Problems with Benchmarks
Despite their common usage, market indexes are generally very poor performance yardsticks, or benchmarks. This is due to the fact that most managers -- at one time or another -- adhere to a specific investment style, such as value or growth, which come in and out of favor. These style effects eventually smooth out over time, but there can be relatively long periods when this smoothing has not yet occurred, such as shown in the following exhibit:
The exhibit shows returns for various stock style indexes. As can be seen, a given performance result can easily be judged as good or bad compared with the broad market merely because of its style orientation. For example, a specialist in managing portfolios of small companies is at a distinct disadvantage over the time period shown in Exhibit 2, so a comparison with the broad market provides an unfair evaluation. The small company market segment, identified as small cap in Exhibit 1 earned a mere 5.8% return per year, lagging the total market return of 14.1% by a significant margin; this underperformance occurred over a fairly long period of 10.75 years. However, this history of underperformance by a given segment of the market does not mean that this segment should be abandoned by investors. Quite the contrary, periods of underperformance are frequently followed by periods of outperformance. Also, enhanced diversification is achieved by combining uncorrelated asset groups such as those shown in the following exhibit.
As can be seen in Exhibit 3, the small cap segments of the market are generally uncorrelated with their larger cap counterparts. Accordingly, they make good diversification partners for these counterparts. For example, the residual correlation of -.421 between small growth and mid cap value says that returns on these two sectors tend to move in opposite directions relative to the market.
A fairer and more accurate evaluation than that provided by market indexes is achieved by using custom benchmarks designed to capture the essence of a manager's investment approach. A practice that is gaining in popularity combines style benchmarks in proportion to the manager's history of style mix. The manager's style mix can be determined by returns-based style analyses developed by Professor William F. Sharpe of Stanford University. This blended style index results in a better benchmark, but still leaves the evaluator with the job of interpreting the performance difference between the portfolio performance and that of the benchmark. The issue here is one of significance: How much outperformance constitutes meaningful success? Studies show that current statistical techniques require at least 10 years of performance history to achieve confident inferences of success or failure against custom benchmarks. Of course, in most cases the management team has changed enough during this time to render the success/failure inferences invalid; in other words, the people who were responsible for the long track record have frequently moved on.
The Problems with Peer Groups
Peer groups solve this waiting-time problem, but have a whole set of other problems. Comparing your portfolio's performance to that of other managed portfolios with the same style, such as a mutual fund database, is called a "peer group" comparison. Professional performance evaluators have used this approach for the past three decades. The idea is to give the manager a report card based on his ranking among competitors with the same style of management by assembling a universe of similar managers.
Critics of the peer group approach have documented various biases that render evaluations based on such universes meaningless. Three of these biases are classification, composition, and survivorship.
All these biases cause performance yardsticks based on peer groups to be unpredictably too long or too short, so that managers are frequently fired or retained for the wrong reasons.
Portfolio Opportunity Distributions: A New Standard
A new scientific approach known as Portfolio Opportunity Distributions (PODs) eliminates these biases and, in so doing, creates a superior backdrop for performance evaluations. PODs harness together today's computing power with classical statistics to generate all the possible portfolios a manager could conceivably hold in line with the manager's own unique decision processes.
The basis for PODs is that, in common practice, the statistician constantly compares his results with those expected purely by chance. By applying this concept to performance evaluation, POD generates thousands of simulated portfolios at random, drawn from the manager's normal universe of stocks, using the manager's portfolio construction rules. This assures that the resulting opportunity distribution fairly reflects the manager's decision processes. For example, PODs can generate all the portfolios that a manager using the following combination of portfolio construction rules could conceivably hold:
The resulting distribution provides a grading system that shows the full range of results (or opportunities) that could have been achieved by the manager while eliminating the biases inherent to peer group universes. These PODs have a custom benchmark as their median, with fractiles around the median representing degrees of success or failure. A ranking in the top decile of a POD universe gives the statistician 90% confidence that the return was not merely random, but a significant indication of success. Similarly, a ranking in the bottom decile is a significant indicator of failure. The statistician commonly defines significance as an event that can be interpreted with 90%, or more, confidence.
Evaluation against a POD universe tells the investor whether the observed performance result was good or bad relative to the unique opportunities available. As discussed earlier, no index, benchmark, or peer group universe can provide this insight. Further investigations into the reasons for success or failure -- such as attribution analyses and manager interviews -- can reveal the manager's level of skill.
POD universes replace the need for peer groups and custom benchmarks. Valid inferences of success or failure are made immediately, without bias -- providing fairness, accuracy, and timeliness far superior to current approaches. Furthermore, the science has been extended beyond U.S. borders so that superior evaluations can be achieved for international investment programs. This is particularly important because international investing is on the rise, creating a growing need for accurate performance evaluations that cannot be met by other current approaches.
The Methodology and Results of PODs
PODs are available through several service providers, including Effron, Ibbotson, M`bius, Thomson, and Zephyr. Generally, POD universes are very similar to large managed universes when such universes exist. Because of survivorship biases, this similarity diminishes over longer time periods, with managed medians tending to exceed POD medians. When managed universes are small, as is the case with non-U.S. markets, PODs are materially different and substantially more accurate. This is especially true for international specialties such as Pacific Basin ex-Japan and Large Value. Also, after-tax PODs can be tailored to fit an investor's unique situation.
Ideally, PODs are constructed by carefully defining the manager's investment universe, decision rules, and portfolio construction processes. These inputs are then used to computer-generate all possible portfolios. As a practical matter, these inputs are seldom available, so certain building blocks have been established to facilitate POD construction. Manager styles are defined as a mix of more than 200 market segments spanning the globe, as shown in Exhibit 4. Market segments are first defined by geography, such as U.S., Europe, or Japan. Then within each geographic region, nine style groups are based on size and orientation, such as large growth, small value, etc.* Industry groups are also created within regions, as shown in the bottom panel of Exhibit 4. Once a manager's style is defined in terms of these building blocks, a corresponding POD universe is generated. For example, we can create a universe unique to all of the opportunities available to a manager with the following style: 26% mid cap/value/U.S., 36% large cap/growth/Japan, and 38% World ex-Japan and U.S.
The middle, or median, of a POD universe is the manager's custom benchmark; it captures the manager's essence. The difference between the actual return and this median is the value added or subtracted by security selection and style rotation. The ranking within a POD universe is the significance of this value added or subtracted.
A real-life example is presented in Exhibit 5, which shows a POD-based evaluation for a manager with a European mandate. The exhibit shows the EAFE index as the risk/reward origin, along with all the opportunities available given the manager's mandate over the 4.5 years ending June 30, 1996. The manager's performance exceeds the EAFE return, while offering lower risk exposure than EAFE. Exhibit 4 puts this outperformance into perspective by contrasting it with the opportunities available to the manager's mandate. The dots shown in the exhibit represent a thousand opportunities available to the manager's European mandate, scanning the whole range of potential portfolio returns - a feat that cannot be accomplished with existing peer groups. As can be seen in the exhibit, the manager has delivered returns that are relatively average in both risk and reward. In other words, the manager has performed in line with expectations relative to the European mandate, but has not hit the home run suggested by comparison with the EAFE index.
Exhibit 6 shows U.S. and international POD universes for the year ending September 30, 1996. Using the data presented here, you can rank your own manager's performance within the appropriate universe. For example, let's say your international return is 8%. This puts you near the median for the total opportunity set available outside the U.S. (middle panel). But did your manager succeed or fail during the year? Inspection of the table shows that with a return of 8%, large growth managers would be considered a success, whereas large value managers would have failed. The bottom panel of Exhibit 6 can be used to further evaluate the currency component of your manager's international return.
Asset Manager Attribution Once the evaluator has made a valid determination of success or failure, attribution of the reasons for either is in order. Given the current awareness of the importance of style, a good attribution analysis should proceed along the following lines:
*See the Appendix for the guidelines used to create style groups.
Exhibit 7: Performance Attribution Using Styles
This table offers a sharp contrast to approaches that use families of complex factors or economic sectors to perform attributions. It's much easier to understand, and more powerful at explaining successes and failures.
All the pieces of the performance puzzle have been put on the table for proper measurement and evaluation. If you arrange them correctly, your reward is superior control of your investment program, which should lead to accelerated achievement of your objectives. New tools have been developed that make this task much easier than ever before. Exhibit 8 shows how the measurements can be integrated into a comprehensive overview. The left side of the table shows the impact of the performance pieces on asset growth. The right side keeps track of the objective, which in this case is funding liabilities.
Exhibit 8: Pension Plan Asset & Liability Growth
As this chapter shows, evaluations can - and should - be customized to all three pieces of the performance puzzle. Policy is evaluated in light of the needs and risk tolerance of the investor. Policy management is evaluated relative to the opportunities afforded by the latitudes of the investment policy. Asset management is evaluated against the unique Portfolio Opportunity Distribution available to each individual manager.
Today's technology gives investors superior control over their investment programs. All they need to do is avail themselves of these powerful new tools.
APPENDIX: STYLE GROUPINGS
Style groupings are based on data provided by Compustat. Two security databases are used. The U.S. database covers more than 7500 firms, with total capitalization exceeding $7 trillion. The non-U.S. database coverage exceeds 5500 firms, 20 countries, and $9 trillion - substantially broader than EAFE.
To construct style groupings, we first break the Compustat database for the region into size groups based on market capitalization, calculated by multiplying shares outstanding by price per share. Beginning with the largest capitalization company, we add companies until 60% of the entire capitalization of the region is covered. This group of stocks is then categorized as "large cap" (capitalization). For the U.S. region, this group currently comprises 240 stocks, all with capitalizations in excess of $6 billion. The second size group represents the next 35% of market capitalization and is called "mid cap". Finally, the bottom 5% is called "small cap", or "mini cap".
Then, within each size group, a further breakout is made on the basis of orientation. Value, core, and growth stock groupings within each size category are defined by establishing an aggressiveness measure. Aggressiveness is a proprietary measure that combines dividend yield and price/earnings ratio. The top 40% (by count) of stocks in aggressiveness are designated as "growth," while the bottom 40% are called "value," with the 20% in the middle falling into "core."