Wednesday, June 10, 2015

Memory profiling to show how 'ref' keyword works

I was reviewing some C# with a signature equivalent to:

public void DeDuplicate(ref Set) {
   // Do something

And I have to admit, I'm not a big fan of the 'ref' keyword here. The argument to use it, was 'to save memory'. That triggered me to dig into the code, find out why it was used, and prove that it wasn't necessary.

So the issue we were working on is as follows. We have a large collection of items, deserialized from JSON. Some of items in the list are duplicates and we're looking for a way to compress the list. Let's simplify things and assume we're just de-duplicating.

So let's say we're dealing with people:

    [DebuggerDisplay("Person with name = {Name}")]
    internal class Person : IEquatable[Person]
        public string Name { get; private set; }

        // The person is a heavy object ( 1 mb )
        private byte[] data = new byte[1024 * 1024];

        public Person(string name)
            this.Name = name;

        public bool Equals(Person other)
            return other.Name == this.Name;

        public override int GetHashCode()
            return this.Name.GetHashCode();

And here's our collection:

    internal class PersonList
        public List People { get; private set; }

        public PersonList(IEnumerable people)
            this.People = people.ToList();

And we need to de-duplicate this:
        public static PersonList DeDuplicate(this PersonList list)
            var uniquePeople = list.People.Distinct();
            var newlist = new PersonList(uniquePeople);
            return newlist;

        public static void DeDuplicate(ref PersonList list)
            var uniquePeople = list.People.Distinct();
            var newlist = new PersonList(uniquePeople);
            list = newlist;

On top a method signature as proposed by me, at the bottom the current implementation.

The developer felt like he needed to pass the reference of the object in order to 'save memory'. That, I think, shows that he doesn't fully understand wha't happening in the code. When we're passing the PersonList parameter into the first method, we're not copying the list, we're only passing in a copy of the reference to that list. The list itself contains references to all the items that it contains.

In the second method, we're passing in a reference to the reference of the list. He did understand what the implications were of doing this and leveraged it, by swapping out the reference to the newly created list. However - the memory difference between is limited to the allocation of the newlist object that contains a list with references. Compared to the 1mb per person, this is nihil.

Can I prove that? Sure I can - let's do a memory profiling:

I've made a person take up 10mb RAM in this case, to make the deltas more apparent. But hey - there's no memory being released at all! How is that possible?

Well - that's because I'm running in debug mode. When in debug mode - the CLR holds on to all the objects I could possible hover over while in scope of the main method. So also the list and all of it's contents. So let's switch to release mode:

as expected - we now see a reduction in the memory footprint from 30 megs to 10 megs, when the duplicate John person is removed and later garbage collected.

So now let's see the other implementation - the one that returns a list and doesn't pass in a ref:

Hey - that one jumps down to it's original footprint after garbage collection ! How's that possible? The answer to that is - that the CLR knows that newList is never accessed. Therefore it can just be collected, as soon as it was assigned.

So making a slight modification in the code will show the 10mb drop, just by calling something in the list after deduplication:

So now, as expected, the GC will remove the duplicate item from memory, since newlist doesn't use it anymore, and finally it just removes the whole object.

So the memory usage is not an argument and is just a result of not fully understanding how objects are being passed around in C#. I can recommend Jon Skeet's post about it, for everyone who's struggling.
Actually, returning the new list seems to make it easier for the GC to just collect the whole thing as you can see.

My other argument is that it's way easier to test the code if you still have access to 'original' list before deduplication. For instance:

            // Act
            var newList = CompressionExtensions.DeDuplicate(list);

            // Assert
            var numberOfJohnsBefore = list.People.Count(person => person.Name == "John");
            var numberOfJohnsAfter = newList.People.Count(person => person.Name == "John");
            Debug.Assert(numberOfJohnsAfter == numberOfJohnsBefore -1);

If I pass in the ref to the collection, which is then replaced, by the new collection, I don't have access to the original anymore.
I very much like to be able to have the before- and after situation, so I can compare the two in my assertions.

No comments:

Post a Comment