2024-04-28
Use more model!
Almost all of my optimization work in Power BI and Tabular seems to come down to this one idea.
I cannot remember the last time I spent significant time on a problem that did not end up with much less DAX and more tables, columns, relationships, or a combination of the three.
I hope to soon write more about modeling and some of the more interesting patterns I have been working on lately.
For now, a fun little diversion on string searching.
In this post, I explore a model to optimize substring search in a text column with a lookup table and an N:N bridge.
It turns out that this is significantly faster than filtering on a substring search (e.g., with CONTAINSSTRING
).
Why: two reasons
The inspiration was simple. There was a link to The Italians' excellent article on string searching in a conversation about optimizing substring search. That article is worth a read or two; go check it out. But I found myself wondering if we could improve performance by building our own index that could take advantage of the ridiculous optimization of VertiPaq relationships. I call this an index, because it is a data structure that represents data in a field we care about, and we build it specifically to optimize a lookup.
Why I spent time on something that I believe I will never recommend is slightly more involved.
- It is just plain fun.
- I believed I could build a self-contained solution pretty quickly.
- I find a lot of value in working through problems and ideas. I think people should be more willing to mock things up like this just to figure something out.
The situation
Filter a column with many unique strings to only those strings that contain a specific substring. There are two naive approaches that work well and are probably more than sufficient for anyone's use cases:
Use a 'Contains' filter in the filter pane
DAX measure with
CONTAINSSTRING
or another related function. These are covered thoroughly in the linked post above from The Italians, but a simple version might be:CALCULATE ( [Measure], CONTAINSSTRING ( 'Table'[String], "substring" ) )
As The Italians describe (and as you can test easily for yourself (and as we will see in the sample solution)), most of the time is spent in the Formula Engine. Formula Engine queries do not cache, and tend to be much slower than Storage Engine queries. The simplest heuristic for keeping work in the storage engine is to stick to basic aggregations on physical columns, tables, and relationships.
The solution
I sketched out a model as described below. (Side note, this is a specification template I have been working on to talk about models in a way that is friendly to source control, easy to read, rapid to modify, and does not require a working data model. Let me know if you have any thoughts or questions on this shorthand.)
String
Field | Type | Note |
---|---|---|
String | string | A unique string (standing in for a name or categorical description in some large dimension) |
StringKey | int | PK |
This table is a minimal stand-in for some table that would have a lot of strings, likely a large dimension. I used the system word list from Ubuntu, with ~100K words in it as a source.
I welcome any readers to try this on a different set of strings; the sample model has some easily reusable components.
If you want to try this on an existing model, you should break out a dimension similar to this one, and you can hang that off the table with strings.
It will be easier to have a table of unique strings that you then tie to your existing table via 'String'[StringKey]
.
You could also use the existing PK in a dimension and bridge from 'Substring'
directly to your 'Dimension'
.
Substring
Field | Type | Note |
---|---|---|
Substring | string | A unique substring that appears in one of the strings in 'String' |
SubstringKey | int | PK |
This is generated in M by the function TextSubstrings
, a function I wrote to extract substrings from a string.
BridgeStringSub
Field | Type | Note |
---|---|---|
StringKey | FK | |
SubstringKey | FK |
Standard N:N bridge.
ERD
Relationships set up as below
'Substring'[SubstringKey] -1:N-> 'BridgeStringSub'[SubstringKey]
'BridgeStringSub'[StringKey] <-N:1-> 'String'[StringKey]
Additional notes
TextSubstrings
is a good example of lower level string processing in M.
We're also following a pretty inefficient pattern of deriving surrogate keys in M and then joining on natural keys to build a table with the correct FKs.
This key-building pattern works, but can fall over with larger tables.
M performance and ETL optimization were not the point of this exercise.
Performance and conclusion
There is a sample PBIX that has all source code and some examples set up. You can play with some slicers to see the crossfiltering. These are clunky, but they perform very well. There are also some sample queries saved in the Queries pane. I encourage you to run those queries (and variations on them) in DAX Studio to check out the timings.
Here's what I saw.
For relationships only:
EVALUATE
CALCULATETABLE ( 'String', 'Substring'[Substring] = "caterp" )
//Timings
// Cold Cache: Total: 10ms; SE CPU 16ms; FE: 2ms; SE: 8ms
// Warm Cache: Total: 4ms; SE CPU 0ms; FE: 1ms; SE: 3ms
Using string search;
EVALUATE
CALCULATETABLE ( 'String', CONTAINSSTRING ( 'String'[String], "caterp" ) )
//Timings
// Cold Cache: Total: 80ms; SE CPU 0ms; FE: 74ms; SE: 6ms
// Warm Cache: Total: 67ms; SE CPU 0ms; FE: 66ms; SE: 1ms
Note the caching is technically better with the substring search, but there is so much more FE that it doesn't matter. The important thing is that we shifted our logic to take advantage of the hyper-optimized relationship code path in VertiPaq. This approach trades up-front computation and storage space for better query-time performance. This is a tradeoff that may not always make sense.
It seems like almost everything I have had to spend time on this year had a component of adding columns and relationships to solve a problem. This is one problem that seemed amenable to it.
If you can frame your problem in terms of tables and relationships, you will have an easy time in Tabular and Power BI. Framing it this way in the first place can be difficult.
As always, feel free to get in touch.