A fast alternative to Linq's Intersect.Any() expression result
A fast alternative to Linq’s Intersect.Any() expression result
While looking at the profiling data of one of our production systems, I stumbled across this interesting block in the CPU flame graph.
Let’s call the obscured method GetData
. Over 80 % of the time in GetData
is spent in the LINQ expression where the Any
extension method is called on the result of Intersect
extension method. This GetData
method was obviously in hot path of the application, getting executed multiple times per incoming request.
The goal of using this LINQ expression in our GetData
method was to find whether there was a common element present between two string collections. Something like below.
The profiler also tells us that this method took ~5 % of total CPU of our service process.
Looking at the source code of Intersect
method, we can see it does two things.
- Creates a Hash set and add items from first collection to it.
- Tries to remove each items of second collection from this hash set as we iterate through the result of this method. The Remove method returns
True
when it successfully removes an item (a match has been found) and return those removed items(the common items).
The Intersect
method returns an Enumerable collection(deferred execution). So If the Any
method returns True
, it won’t execute the code for remaining elements in second collection.
In my case, All I care about is finding out whether we have a common element present between these 2 collections. We could do this just with a nested for loop without allocating the HashSet. We could also eliminate the hash generation during the Remove
method call. I wrote an extension method like below which uses 2 simple for loops.
This may look like a brute force solution. Would it do better compared to the LINQ expression? Only way to tell that is by measuring both implementations. So I wrote benchmarking code for these 2 implementations with 2 types of source collections, Arrays and Lists. Our extension method uses a for loop on an array. So If the collection passed in is not an array, it will call the AsArray
extension method to convert that collection to an array.
The benchmark results
The benchmark results clearly shows:
- When both the source collections are Arrays, the custom implementation is 7 times faster than the LINQ expression. Also it allocates 0 bytes (no hash set allocation in managed heap. The for loop runs on the stack)
- When both the source collections are Lists, the custom implementation is 4.5 times faster than the LINQ expression. There is allocation where our
AsArray
method has to convert the collection to an Array. But still a lot smaller than what the LINQ expression allocated.
And finally, after deploying the fix to prod, I went back and checked the prod profiler data again.
The GetData
method now consumes 1.6 % CPU where it used to consume around 5 %.
Key takeaways are,
- LINQ expression is concise, but may not be optimal if that is being used in a hot path. Many times, you can write a faster implementation which handles your specific use case
- Measure, Fix, Measure again, Repeat.
Cheers