Monday, December 15, 2008

Elo Rating System

The Elo rating system is a method for calculating the relative skill levels of players in two-player games such as chess and Go.
"Elo" is often written in capital letters (ELO), but it is not an acronym. It is the family name of the system's creator, Arpad Elo (1903–1992, born as Élő Árpád), a Hungarian-born American physics professor.
Elo was originally invented as an improved chess rating system although it is used in many games today. It is also used as a rating system for competitive multi-player play in a number of computer games, and has been adapted to team sports including association football, American college football and basketball, and Major League Baseball.

This section does not cite any references or sources.
Please help improve this section by adding citations to reliable sources. Unverifiable material may be challenged and removed. (March 2007)
Arpad Elo was a master-level chess player and an active participant in the United States Chess Federation (USCF) from its founding in 1939.[1] The USCF used a numerical ratings system, devised by Kenneth Harkness, to allow members to track their individual progress in terms other than tournament wins and losses. The Harkness system was reasonably fair, but in some circumstances gave rise to ratings which many observers considered inaccurate. On behalf of the USCF, Elo devised a new system with a more statistical basis.

Elo's system replaced earlier systems of competitive rewards with a system based on statistical estimation. Rating systems for many sports award points in accordance with subjective evaluations of the 'greatness' of certain achievements. For example, winning an important golf tournament might be worth an arbitrarily chosen five times as many points as winning a lesser tournament.
A statistical endeavor, by contrast, uses a model that relates the game results to underlying variables representing the ability of each player.
Elo's central assumption was that the chess performance of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time. Elo thought of a player's true skill as the mean of that player's performance random variable.
A further assumption is necessary, because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and say, "That performance is 2039." Performance can only be inferred from wins, draws and losses. Therefore, if a player wins a game, he is assumed to have performed at a higher level than his opponent for that game. Conversely if he loses, he is assumed to have performed at a lower level. If the game is a draw, the two players are assumed to have performed at nearly the same level.
Elo did not specify exactly how close two performances ought to be to result in a draw as opposed to a win or loss. And while he thought it likely that each player might have a different standard deviation to his performance, he made a simplifying assumption to the contrary.
To simplify computation even further, Elo proposed a straightforward method of estimating the variables in his model (i.e., the true skill of each player). One could calculate relatively easily, from tables, how many games a player is expected to win based on a comparison of his rating to the ratings of his opponents. If a player won more games than he was expected to win, his rating would be adjusted upward, while if he won fewer games than expected his rating would be adjusted downward. Moreover, that adjustment was to be in exact linear proportion to the number of wins by which the player had exceeded or fallen short of his expected number of wins.
From a modern perspective, Elo's simplifying assumptions are not necessary because computing power is inexpensive and widely available. Moreover, even within the simplified model, more efficient estimation techniques are well known. Several people, most notably Mark Glickman, have proposed using more sophisticated statistical machinery to estimate the same variables. On the other hand, the computational simplicity of the Elo system has proven to be one of its greatest assets. With the aid of a pocket calculator, an informed chess competitor can calculate to within one point what his next officially published rating will be, which helps promote a perception that the ratings are fair.

Implementing Elo's scheme

The USCF implemented Elo's suggestions in 1960,[2] and the system quickly gained recognition as being both fairer and more accurate than the Harkness system. Elo's system was adopted by FIDE in 1970. Elo described his work in some detail in the book The Rating of Chessplayers, Past and Present, published in 1978.
Subsequent statistical tests have shown that chess performance is almost certainly not normally distributed. Weaker players have significantly greater winning chances than Elo's model predicts. Therefore, both the USCF and FIDE have switched to formulas based on the logistic distribution. However, in deference to Elo's contribution, both organizations are still commonly said to use "the Elo system".

Different ratings systems

The phrase "Elo rating" is often used to mean a player's chess rating as calculated by FIDE. However, this usage is confusing and often misleading, because Elo's general ideas have been adopted by many different organizations, including the USCF (before FIDE), the Internet Chess Club (ICC), Yahoo! Games, and the now defunct Professional Chess Association (PCA). Each organization has a unique implementation, and none of them precisely follows Elo's original suggestions. It would be more accurate to refer to all of the above ratings as Elo ratings, and none of them as the Elo rating.
Instead one may refer to the organization granting the rating, e.g. "As of August 2002, Gregory Kaidanov had a FIDE rating of 2638 and a USCF rating of 2742." It should be noted that the Elo ratings of these various organizations are not always directly comparable. For example, someone with a FIDE rating of 2500 will generally have a USCF rating near 2600 and an ICC rating in the range of 2500 to 3100.

FIDE ratings

For top players, the most important rating is their FIDE rating. FIDE issues a ratings list four times a year.
The following analysis of the January 2006 FIDE rating list gives a rough impression of what a given FIDE rating means:
19743 players have a rating above 2200, and are usually associated with the Candidate Master title.
1868 players have a rating between 2400 and 2499, most of whom have either the IM or the GM title.
563 players have a rating between 2500 and 2599, most of whom have the GM title
123 players have a rating between 2600 and 2699, all (but one) of whom have the GM title
18 players have a rating between 2700 and 2799
Only 4 players (Garry Kasparov, Vladimir Kramnik, Veselin Topalov and Viswanathan Anand) have ever exceeded a rating of 2800, and none do in the latest (October 2008) list.
The highest ever FIDE rating was 2851, which Garry Kasparov had on the July 1999 and January 2000 lists.
In the whole history of FIDE rating system, only 48 players (to October 2007), sometimes called "Super-grandmasters", have achieved a peak rating of 2700 or more.
Performance rating
A "performance rating" is a hypothetical rating that would result from the games of a single event only. A performance rating for an event is calculated by taking (1) the rating of each player beaten and adding 400, (2) the rating of each player lost to and subtracting 400, (3) the rating of each player drawn, and (4) summing these figures and dividing by the number of games played.

FIDE tournament categories

FIDE classifies tournaments into categories according to the average rating of the players. Each category is 25 rating points wide. Category 1 is for an average rating of 2251 to 2275, category 2 is 2276 to 2300, etc.[3] The highest rated tournaments have been Category 21, with an average from 2751 to 2775. The top categories are as follows:
Category Meaning
15 Average rating is in range 2601 to 2625
16 Average rating is in range 2626 to 2650
17 Average rating is in range 2651 to 2675
18 Average rating is in range 2676 to 2700
19 Average rating is in range 2701 to 2725
20 Average rating is in range 2726 to 2750
21 Average rating is in range 2751 to 2775

Live ratings

FIDE updates its ratings list every three months. In contrast, the unofficial "Live ratings" calculate the change in players' ratings after every game. These Live ratings are based on the previously published FIDE ratings, so a player's Live rating is intended to correspond to what the FIDE rating would be if FIDE was to issue a new list that day.
Although Live ratings are unofficial, interest arose in Live ratings in August/September 2008 when five different players took the "Live" #1 ranking.[4]
The unofficial live ratings are published and maintained by Hans Arild Runde at http://chess.liverating.org . Only players over 2700 are covered.
[edit]United States Chess Federation ratings
The United States Chess Federation (USCF) uses its own classification of players: [5]
2400 and above: Senior Master
2200 - 2399: Master
2000 - 2199: Expert
1800 - 1999: Class A
1600 - 1799: Class B
1400 - 1599: Class C
1200 - 1399: Class D
1000 - 1199: Class E
In general, 1000 is considered a bright beginner. A regular competitive chess player might be rated at approximately 1750.
The K factor, in the USCF rating system, can be estimated by dividing 800 by the effective number of games a player's rating is based on (Ne) plus the number of games the player completed in a tournament (m).[6]

Ratings of computers

Since 2005–2006, human-computer chess matches have demonstrated that chess computers are stronger than the strongest human players. However ratings of computers are difficult to quantify. There have been too few games under tournament conditions to give computers or software engines an accurate rating.[7] Also, for chess engines, the rating is dependent on the machine a program runs on.
For some ratings estimates, see Chess Engines rating lists.

Game activity versus protecting one's rating

This section has multiple issues. Please help improve the article or discuss these issues on the talk page.
It does not cite any references or sources. Please help improve it by citing reliable sources. Tagged since March 2008.
It may contain original research or unverifiable claims. Tagged since March 2008.
Its neutrality or factuality may be compromised by weasel words. Tagged since March 2008.
In general the Elo system has increased the competitive climate for chess and inspired players for further study and improvement of their game.[citation needed] However, in some cases ratings can discourage game activity for players who wish to "protect their rating".
Examples:
They may choose their events or opponents more carefully where possible.
If a player is in a Swiss tournament, and loses a couple of games in a row, they may feel the need to abandon the tournament in order to avoid any further rating "damage".
Junior players, who may have high provisional ratings, and who should really be practicing as much as possible,[citation needed] might play less than they would, because of rating concerns.
In these examples, the rating "agenda" can sometimes conflict with the agenda of promoting chess activity and rated games. [11]
Some of the clash of agendas between game activity, and rating concerns is also seen on many servers online which have implemented the Elo system. For example, the higher rated players, being much more selective in who they play, results often in those players lurking around, just waiting for "overvalued" opponents to try and challenge. Such players may feel discouraged of course from playing any significantly lower rated players again for rating concerns. And so, this is one possible anti-activity/anti-social aspect of the Elo rating system which needs to be understood. The agenda of points scoring can interfere with playing with abandon, and just for fun.[citation needed]
Interesting from the perspective of preserving high Elo ratings versus promoting rated game activity is a recent proposal by British Grandmaster John Nunn regarding qualifiers based on Elo rating for a World championship model.[12] Nunn highlights in the section on "Selection of players", that players not only be selected by high Elo ratings, but also their rated game activity. Nunn clearly separates the "activity bonus" from the Elo rating, and only implies using it as a tie-breaking mechanism.
The Elo system when applied to casual online servers has at least two other major practical issues that need tackling when Elo is applied to the context of online chess server ratings. These are engine abuse and selective pairing.[citation needed]

Chess engines

The first and most significant issue is players making use of chess engines to inflate their ratings. This is particularly an issue for correspondence chess style servers and organizations, where making use of a wide variety of engines within the same game is entirely possible. This would make any attempts to conclusively prove that someone is cheating quite futile. Blitz servers such as the Free Internet Chess Server or the Internet Chess Club attempt to minimize engine bias by clear indications that engine use is not allowed when logging on to their server.

Selective pairing

A more subtle issue is related to pairing. When players can choose their own opponents, they can choose opponents with minimal risk of losing, and maximum reward for winning. Such a luxury of being able to hand-pick your opponents is not present in Over-the-Board Elo type calculations, and therefore this may account strongly for the ratings on the ICC using Elo which are well over 2800.
Particular examples of 2800+ rated players choosing opponents with minimal risk and maximum possibility of rating gain include: choosing computers that they know they can beat with a certain strategy; choosing opponents that they think are over-rated; or avoiding playing strong players who are rated several hundred points below them, but may hold chess titles such as IM or GM. In the category of choosing over-rated opponents, new-entrants to the rating system who have played less than 50 games are in theory a convenient target as they may be overrated in their provisional rating. The ICC compensates for this issue by assigning a lower K-factor to the established player if they do win against a new rating entrant. The K-factor is actually a function of the number of rated games played by the new entrant.
Elo therefore must be treated as a bit of fun when applied in the context of online server ratings. Indeed the ability to choose one's own opponents can have great fun value also for spectators watching the very highest rated players. For example they can watch very strong GM's challenge other very strong GMs who are also rated over 3100. Such opposition, which the highest level players online would play in order to maintain their rating, would often be much stronger opponents than if they did play in an Open tournament which is run by Swiss pairings. Additionally it does help ensure that the game histories of those with very high ratings will often be with opponents of similarly high level ratings.
Therefore, Elo ratings online still provide a useful mechanism for providing a rating based on the opponent's rating. Its overall credibility, however, needs to be seen in the context of at least the above two major issues described — engine abuse, and selective pairing of opponents.
The ICC has also recently introduced "auto-pairing" ratings which are based on random pairings, but with each win in a row ensuring a statistically much harder opponent who has also won x games in a row. With potentially hundreds of players involved, this creates some of the challenges of a major large Swiss event which is being fiercely contested, with round winners meeting round winners. This approach to pairing certainly maximizes the rating risk of the higher-rated participants, who may face very stiff opposition from players below 3000 for example. This is a separate rating in itself, and is under "1-minute" and "5-minute" rating categories. Maximum ratings achieved over 2500 are exceptionally rare.

Ratings inflation and deflation

This section does not cite any references or sources.
Please help improve this section by adding citations to reliable sources. Unverifiable material may be challenged and removed. (August 2007)

This section is written like a personal reflection or essay and may require cleanup. Please help improve it by rewriting it in an encyclopedic style. (March 2008)
The primary goal of Elo ratings is to accurately predict game results between contemporary competitors, and FIDE ratings perform this task relatively well. A secondary, more ambitious goal is to use ratings to compare players between different eras. (See also Greatest chess player of all time.) It would be convenient if a FIDE rating of 2500 meant the same thing in 2005 that it meant in 1975. If the ratings suffer from inflation, then a modern rating of 2500 means less than a historical rating of 2500, while if the ratings suffer from deflation, the reverse will be true. Unfortunately, even among people who would like ratings from different eras to "mean the same thing", intuitions differ sharply as to whether a given rating should represent a fixed absolute skill or a fixed relative performance.
Those who believe in absolute skill (including FIDE[13]) would prefer modern ratings to be higher on average than historical ratings, if grandmasters nowadays are in fact playing better chess. By this standard, the rating system is functioning perfectly if a modern 2500-rated player and a 2500-rated player of another era would have equal chances of winning, were it possible for them to play. The advent of strong chess computers allows a somewhat objective evaluation of the absolute playing skill of past chess masters, based on their recorded games.
Those who believe in relative performance would prefer the median rating (or some other benchmark rank) of all eras to be the same. By one relative performance standard, the rating system is functioning perfectly if a player in the twentieth percentile of world rankings has the same rating as a player in the twentieth percentile used to have. Ratings should indicate approximately where a player stands in the chess hierarchy of his own era.
The average FIDE rating of top players has been steadily climbing for the past twenty years, which is inflation (and therefore undesirable) from the perspective of relative performance. However, it is at least plausible that FIDE ratings are not inflating in terms of absolute skill. Perhaps modern players are better than their predecessors due to a greater knowledge of openings and due to computer-assisted tactical training.
In any event, both camps can agree that it would be undesirable for the average rating of players to decline at all, or to rise faster than can be reasonably attributed to generally increasing skill. Both camps would call the former deflation and the latter inflation. Not only do rapid inflation and deflation make comparison between different eras impossible, they tend to introduce inaccuracies between more-active and less-active contemporaries.
The most straightforward attempt to avoid rating inflation/deflation is to have each game end in an equal transaction of rating points. If the winner gains N rating points, the loser should drop by N rating points. The intent is to keep the average rating constant, by preventing points from entering or leaving the system.
Unfortunately, this simple approach typically results in rating deflation, as the USCF was quick to discover.
A common misconception is that rating points enter the system every time a previously unrated player gets an initial rating and that likewise rating points leave the system every time someone retires from play. Players generally believe that since most players are significantly better at the end of their careers than at the beginning, that as they tend to take more points away from the system than they brought in, the system deflates as a result. This is a fallacy and is easily shown. If a system is deflated, players will have strengths higher than their ratings. But if they take points out of the system EQUAL TO their strength when they leave the system, no inflation or deflation will result.
Rather, in the "basic form" of the Elo system, the cause of deflation is the fact that players improve. The cause of inflation is that their strength relative to their rating will tend to decline over time with age. Since most players improve early in their career, the system tends to deflate at that time. Inflation doesn't occur until much later in a player's career. Many players will quit before this natural process occurs, which would return points to the system. The net result over time is deflation.

Example

Both of these misconceptions can be shown incorrect via the following example:
Let's consider the following example, which is a little contrived to make it simple. However, the principles remain the same in other pools irrespective of the level of complexity.
Suppose there are four 1500-rated players: A, B, C, D. They are all established players. Their ratings are stable. To make the calculations easy, we will assume that we will calculate rating changes under the old Elo formula with K=32. Suppose that
are the only four players in the rating pool. The average rating of the pool is, therefore, 1500.
Elo recognized that simply having an improving player causes deflation.
Let's suppose that A decides to study for a while. As a result, his strength increases to a degree that on average he scores 3 out of 4 against B, C, and D. These odds represent roughly a 200 rating point spread.
What we would want the pool to do is this: Since B, C, and D are the same strength as before, their ratings should stay at 1500. A should see his rating go toward 1700.
That is, their performances indicate a strength relative to where they started of 1500 for B, C, and D and 1700 for A.
Let's suppose that they play 10 rated games against each opponent (30 total.) B, C, and D score 50% against each other, but only 25% against A, exactly as outlined above. That means B, C, and D win 12.5 games each, and A wins 22.5 games. For example, on average, B wins 5 of the 10 games against C, 5 of the 10 games against D, and 2.5 of the games against A.
What are their ratings (assuming for ease that we rate this as 1 event) at the end of these encounters? (This example is simplified, but illustrates the point, and the principles hold true even if we treat it as several events.)
The rating formula is:
(W-We) x 32 + Rating old = Rating new,
where W equals the number of wins and We equals the expected number of wins.
Since all players started at 1500, we expect them all to score 50%. The winning expectancy, We, therefore = 15 for all the players.
For B, C, D:
(12.5-15.0) x 32 +1500 = 1420
12.5 points, for B for example, is 5 points against C, 5 against D, and 2.5 against A.
For A:
(22.5 - 15.0) x 32 + 1500 = 1740
What is the average rating of the pool? (1420 + 1420 + 1420 +1740)/4 = 1500.
Hmm...exactly the same as before.
Yet, B, C, D are all rated lower than their actual skill level of 1500. And even if A loses his "40 extra" points back to the pool fairly evenly in another series of games, we would see:
A: 1700
B: 1433
C: 1433
D: 1433
That is, 75% of the players in the pool would be deflated, by 67 points each, even though the average rating of the entire pool is unchanged.
[edit]Practical approaches
Because of the significant difference in timing of when inflation and deflation occur, and in order to combat deflation, most implementations of Elo ratings have a mechanism for injecting points into the system in order to maintain relative ratings over time. FIDE has two inflationary mechanisms. First, performances below a "ratings floor" are not tracked, so a player with true skill below the floor can only be unrated or overrated, never correctly rated. Second, established and higher-rated players have a lower K-factor.[9] There is no theoretical reason why these should provide a proper balance to an otherwise deflationary scheme; perhaps they over-correct and result in net inflation beyond the playing population's increase in absolute skill. On the other hand, there is no obviously superior alternative. In particular, on-line game rating systems have seemed to suffer at least as many inflation/deflation headaches as FIDE, despite alternative stabilization mechanisms.

Other chess rating systems

Ingo system, designed by Anton Hoesslinger, published in 1948. Starting in West Germany in 1948, it was used in Germany as official rating system of the German Chess Federation until 1992 when it was replaced by an Elo based rating system. It influenced some other rating systems.
Harkness System, invented by Kenneth Harkness, who published it in 1956. It was used by the USCF from 1950 to 1960 and by some other organizations.
British Chess Federation Rating System, published in 1958.
Correspondence Chess League of America Rating System.
Glicko rating system
Chessmetrics
In November 2005, the Xbox Live online gaming service proposed the TrueSkill ranking system that is an extension of Glickman's system to multi-player and multi-team games.
From Wikipedia, the free encyclopedia

webRing

Powered by WebRing.