Sunday, March 16, 2008

Don't override Equals

A colleague had a problem the other day which turned out to be due to an overridden Equals operator. In this case it was a straightforward bug in the implementation, but after he saw my horror struck face I had to introduce him to the whole 'don't override Equals' philosophy[1]. On the pretext that you've not come across it, here's the argument in full:

  • You have two objects that came from different places, and need to know if they represent essentially the same data.
  • You can't override Equals unless you also override GetHashCode. If two objects are equal, they must have the same hashcode, or collections are screwed.
  • GetHashCode must return the same value for an instance throughout it's lifetime, or Hashtable's are screwed
  • Your object isn't readonly, so you need an immutable field in the instance to base the hashcode on.
  • But if you modify one instance's data to equal another, that field can't change, so the hashcodes are still different.
  • You're screwed
And that's without getting into the problems associated with a correct implementation of Equals in the first place (getting the reflexive, symmetric and transitive bit right). Generally speaking some kind of IsEquivilent method is a whole heap less trouble, but it depends what you're up to. You might think about loading your objects through some kind of registry, so references to the 'same' data actually end up pointing to the same instance. Then everything just works...

More reading:

UPDATE 10/04/08: Some clarifications: I'm talking about not overriding Equals/GetHashCode for reference types here. It's not such a problem for value types [as IDisposable points out in the comments]. And I've futher clarified some of my assertions about GetHashTable in the comments.

[1] PS: Like all advice, this has exceptions. But the chances are they don't apply in your case. No, really.

8 comments:

Shrike said...

this statement:
"GetHashCode must return the same value for an instance throughout it's lifetime, or Hashtable's are screwed"

Where does it arise from?

See msdn article for Object.GetHashCode() method

There is no such requirement there.

So all your following conclusions seem to be wrong.

IDisposable said...

I've written extensively about Equals and GetHashCode here and corrected some issues here.

Three thoughts:

1) A value-type IS immutable since you are aways working with a copy; thus GetHashCode() and HashTable will get along fine.

2) A reference-type, you have to insure immutability. Tread carefully.

3) Even ninjas make mistakes in GetHashCode, so treat very carefully.

Shrike said...

2IDisposable:
Could you, blog author or anybody else show a code which demonstrates that changing object's hashcode during its staying in Hashtable/Dictionary leads to any issues?
Without that any talks about necessarity of computation hashcode based only on immutable object's data have no sense (imho).

piers7 said...

Ok, I didn't go into too much depth into GetHashCode in the original article, because it was meant to be a high-level summary of the issues, rather than a definative analysis. There's lots of material out there, of which I've linked only a few articles, which is probably a bit slack of me.

To be really exact, the requirement is that an object's doesn't change it's hashcode whilst it's used as a key in a Hashtable. See the article I linked in the original post: The Rules for GetHashCode [http://www.interact-sw.co.uk/iangblog/2004/06/21/gethashcode ] and the doco for Hashtable [http://msdn2.microsoft.com/en-us/library/system.collections.hashtable.aspx ] (remembering the doco is not always right). You'll note this requirement is a function of the hashtable, and not of GetHashCode as such.

Why the immutability? Because it's the hashcode that the hashtable uses to work out which bucket to store the value into, and the hashcode that's used to locate the bucket to determine if a value has already been stored in the hashtable. If the hashvalue for the key changes, this all breaks, and the key can end up being in the hashtable more than once and/or irretrievable.

I've seen this happen: a system that I worked on had an object context implementation a bit like NHibernate, but due to a faulty Equals / GetHashCode implementation ended up leaking objects [they weren't in the hashtable under their original key, so didn't get cleaned up properly as I recall]

Plus you only have to look at some of the descriptions of how to write a decent GetHashCode implementation (and IDisposable has linked some great examples in this post) to realise that - frankly - you just don't want to go there.

Part of the problem here is that they're overridable, so people assume that they are meant to be overridden. In reality, there's only really any point overriding them for value types: for reference types the defaults are just fine thanks.

[Ok, maybe the hashcodes are sequential, which is less than ideal, but leave that as Microsoft's problem unless you've got lots of spare time on your hands. Professional developers: haven't you got some paid work to do?]

Fundamentally my point is that if all you want to do is know if two objects are 'samey', why fight to make the system thing they're the same object if you don't have to.

Shrike said...

It's obvious that GetHashCode's problem connected with staying in Hashtable.
I asked for code which would demonstrate the problem. Not for another long story ;)
So here it is:

using System;
using System.Collections;
using System.Collections.Generic;

class Item
{
public int Value;

public Item(int v)
{
Value = v;
}
public override bool Equals(Object o)
{
if (o == null)
return false;
if (o.GetType() != GetType())
return false;
Item o2 = (Item)o;
return o2.Value == Value;
}

public override int GetHashCode()
{
return Value.GetHashCode();
}
}

public class Tester
{
public static void Main()
{
testWithHashTable();
testWithDictionary();
}

private static void testWithHashTable()
{
Console.WriteLine("Hashtable test");
Hashtable t = new Hashtable();
Item i1 = new Item(1);
t[i1] = i1;

showItem(t, i1);
i1.Value = 2;

Console.WriteLine("After changing item's value");
showItem(t, i1);

Item i2 = new Item(1);
showItem(t, i2);

i2 = new Item(2);
showItem(t, i2);
}

private static void testWithDictionary()
{
Console.WriteLine("\nDictionary test");
Dictionary<Item,Item> d = new Dictionary<Item,Item>();

Item i1 = new Item(1);
d[i1] = i1;

showItem(d, i1);
i1.Value = 2;

Console.WriteLine("After changing item's value");
showItem(d, i1);

Item i2 = new Item(1);
showItem(d, i2);

i2 = new Item(2);
showItem(d, i2);
}

private static void showItem(IDictionary d, Item i)
{
if (d.Contains(i))
Console.WriteLine("found: " + (d[i] as Item).Value);
else
Console.WriteLine("not found");
}
}

Output is:
Hashtable test
found: 1
After changing item's value
not found
not found
not found

Dictionary test
found: 1
After changing item's value
not found
not found
not found

So we see that the object with changed value has lost!

If we comment Item's methods Equals & GetHashCode, then output changes to:
Hashtable test
found: 1
After changing item's value
found: 2
not found
not found

Dictionary test
found: 1
After changing item's value
found: 2
not found
not found

One possible method to solve the problem (in the case we really need to treat two different objects as "the same") is to reset object in the dictionary after changing hashcode depending values:
Console.WriteLine("Hashtable test");
Hashtable t = new Hashtable();
Item i1 = new Item(1);
t[i1] = i1;

showItem(t, i1);
i1.Value = 2;
t[i1] = i1;

Console.WriteLine("After changing item's value");
showItem(t, i1);

Item i2 = new Item(1);
showItem(t, i2);

i2 = new Item(2);
showItem(t, i2);

so output will be:
Hashtable test
found: 1
After changing item's value
found: 2
not found
found: 2

it's correct and expected behavior.

Shrike said...

btw. this link http://www.interact-sw.co.uk/iangblog/2004/06/21/gethashcode
is a bit obsolete because MSDN now contains deffirent requirements for GetHashCode (see http://msdn2.microsoft.com/en-us/library/system.object.gethashcode.aspx):
"The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method"

not as one mentioned in linked article:

"The hash function must return exactly the same value regardless of any changes that are made to the object".

piers7 said...

When you say one solution "is to reset object in the dictionary after changing hashcode depending values", what you've actually done is duplicated the reference in the hashtable, once in the bucket determined by the previous hash, and once in the bucket determined by the new hash. I seem to recall if you loop on the hashtable you'll see this.

This has the potential to cause some unexpected behaviour (to put it mildly) for example: objects never garbage collected (as 'lost' reference still accessible from hashtable); unexpected duplicates etc...

Besides which, asking people who mutate your object to re-add it to any and all hashtable's it's been stored in seems a bit of a fragile arrangement.

Shrike said...

Ok, I agree that "reset" should be "re-add". Dublicating is evil.
Clients will be dissapointed by such requirements for sure. But if you write some internal code where client is only you, why not.

Popular Posts