2024-04-28

Use more model!

Almost all of my optimization work in Power BI and Tabular seems to come down to this one idea. I cannot remember the last time I spent significant time on a problem that did not end up with much less DAX and more tables, columns, relationships, or a combination of the three. I hope to soon write more about modeling and some of the more interesting patterns I have been working on lately. For now, a fun little diversion on string searching. In this post, I explore a model to optimize substring search in a text column with a lookup table and an N:N bridge. It turns out that this is significantly faster than filtering on a substring search (e.g., with CONTAINSSTRING).

Why: two reasons

The inspiration was simple. There was a link to The Italians' excellent article on string searching in a conversation about optimizing substring search. That article is worth a read or two; go check it out. But I found myself wondering if we could improve performance by building our own index that could take advantage of the ridiculous optimization of VertiPaq relationships. I call this an index, because it is a data structure that represents data in a field we care about, and we build it specifically to optimize a lookup.

Why I spent time on something that I believe I will never recommend is slightly more involved.

It is just plain fun.
I believed I could build a self-contained solution pretty quickly.
I find a lot of value in working through problems and ideas. I think people should be more willing to mock things up like this just to figure something out.

The situation

Filter a column with many unique strings to only those strings that contain a specific substring. There are two naive approaches that work well and are probably more than sufficient for anyone's use cases:

Use a 'Contains' filter in the filter pane
DAX measure with CONTAINSSTRING or another related function. These are covered thoroughly in the linked post above from The Italians, but a simple version might be:
```
 CALCULATE ( [Measure], CONTAINSSTRING ( 'Table'[String], "substring" ) )
```

As The Italians describe (and as you can test easily for yourself (and as we will see in the sample solution)), most of the time is spent in the Formula Engine. Formula Engine queries do not cache, and tend to be much slower than Storage Engine queries. The simplest heuristic for keeping work in the storage engine is to stick to basic aggregations on physical columns, tables, and relationships.

The solution

I sketched out a model as described below. (Side note, this is a specification template I have been working on to talk about models in a way that is friendly to source control, easy to read, rapid to modify, and does not require a working data model. Let me know if you have any thoughts or questions on this shorthand.)

String

Field	Type	Note
String	string	A unique string (standing in for a name or categorical description in some large dimension)
StringKey	int	PK

This table is a minimal stand-in for some table that would have a lot of strings, likely a large dimension. I used the system word list from Ubuntu, with ~100K words in it as a source.

I welcome any readers to try this on a different set of strings; the sample model has some easily reusable components. If you want to try this on an existing model, you should break out a dimension similar to this one, and you can hang that off the table with strings. It will be easier to have a table of unique strings that you then tie to your existing table via 'String'[StringKey]. You could also use the existing PK in a dimension and bridge from 'Substring' directly to your 'Dimension'.

Substring

Field	Type	Note
Substring	string	A unique substring that appears in one of the strings in 'String'
SubstringKey	int	PK

This is generated in M by the function TextSubstrings, a function I wrote to extract substrings from a string.

BridgeStringSub

Field	Type	Note
StringKey	FK
SubstringKey	FK

Standard N:N bridge.

ERD

Relationships set up as below

'Substring'[SubstringKey] -1:N-> 'BridgeStringSub'[SubstringKey]
'BridgeStringSub'[StringKey] <-N:1-> 'String'[StringKey]

Additional notes

TextSubstrings is a good example of lower level string processing in M. We're also following a pretty inefficient pattern of deriving surrogate keys in M and then joining on natural keys to build a table with the correct FKs. This key-building pattern works, but can fall over with larger tables. M performance and ETL optimization were not the point of this exercise.

Performance and conclusion

There is a sample PBIX that has all source code and some examples set up. You can play with some slicers to see the crossfiltering. These are clunky, but they perform very well. There are also some sample queries saved in the Queries pane. I encourage you to run those queries (and variations on them) in DAX Studio to check out the timings.

Here's what I saw.

For relationships only:

EVALUATE
CALCULATETABLE ( 'String', 'Substring'[Substring] = "caterp" )

//Timings
// Cold Cache: Total: 10ms; SE CPU 16ms; FE:  2ms; SE: 8ms
// Warm Cache: Total:  4ms; SE CPU  0ms; FE:  1ms; SE: 3ms

Using string search;

EVALUATE
CALCULATETABLE ( 'String', CONTAINSSTRING ( 'String'[String], "caterp" ) )

//Timings
// Cold Cache: Total: 80ms; SE CPU  0ms; FE: 74ms; SE: 6ms
// Warm Cache: Total: 67ms; SE CPU  0ms; FE: 66ms; SE: 1ms

Note the caching is technically better with the substring search, but there is so much more FE that it doesn't matter. The important thing is that we shifted our logic to take advantage of the hyper-optimized relationship code path in VertiPaq. This approach trades up-front computation and storage space for better query-time performance. This is a tradeoff that may not always make sense.

It seems like almost everything I have had to spend time on this year had a component of adding columns and relationships to solve a problem. This is one problem that seemed amenable to it.

If you can frame your problem in terms of tables and relationships, you will have an easy time in Tabular and Power BI. Framing it this way in the first place can be difficult.

As always, feel free to get in touch.