Skip to main content

Home  BISM  PowerPivot  Multidimensional  DAX  PP Services  Blog  Search  Contact  Member Login   
Design > Performance > Security > Links > Demo >  

Multidimensional, Performance

 

Quick performance checklist

Outperform performance of SQL queries

Using hierarchies effectively

Using partitions for query performance

Using partitions for processing performance

Rigid vs Flexible aggregations, query performance

Avoid Many-to-Many dimensions, query performance

Reference dimensions and their performance

Parent-child dimensions and their performance

MDX Performance Hints

Engine components, single or multi-threaded?

Star vs Snowflake schema for UDM/Cube

Links?

 

Quick performance checklist

1-Have a Key Attribute

For each dimension, should be a key attribute. You should try to have an integer column as your key attribute (helps with query performance and needs less storage). Please make sure if you include an integer as the key attribute, then you need to have an integer column for the FACT table to link with dimension table improve the query performance). Also, avoid composite keys for key attributes. In case, you have multiple columns for key attributes, include a surrogate key and define it as the key attribute

 

2-Remove unnecessary attributes

More dimensional attributes you create makes the cube size bigger which means more aggregations and indexes.

 

3-Consider property “AttributeHierarchyOptimizedState”

When the option is set to FullyOptimized, indexes will be created against the attributes. For the attributes which are not used frequently, you can set the option of NotOptimized. By setting this option you will save time processing and resources such as disk space.

 

4-Consider property “AttributeHierarchyEnabled”

You may have attributes which you don’t need for analysis. For example, in a customer dimension attributes like address, birth date, will not be used for analysis. Instead they will be used as just as information. For those attributes you can turn off attributes hierarchy. If the attribute hierarchy is not enabled or AttributeHierarchyEnabled is set to False, then the attribute cannot be used in a user-defined hierarchy.

 

5-Define Hierarchy Relationships

If you have a dimension is related to another attribute by a one-to-many relationship, you should create hierarchy. For example “Geography”

 

6-Selecting correct Process Type

ProcessFull—Erases all the data and rebuilds the dimension data and structure.

ProcessUpdate—Makes inserts, updates, and deletes

ProcessAdd—It only handle data insertions (add new row).

Note: By selecting the correct process type you can reduce the time to process a cube.

 

7-Partitioning

Partitions contain cube data. Small cubes may use one part. Medium and large scale cubes can be split into groups of data which is called partitioning.

Unnecessary partitioning can degrade you cube performance instead of improving it. Also too many partitions makes it difficult to manage.

 

8-Configuring SSAS

Ideally you should have SSAS and the data source on two separate servers. When only one server is available there will be competition for resources (memory and CPU usage). However, you need to make sure you have good network connectivity between the two servers.

 

9-Warming Cache

After cube processing and on startup SSAS service, run typical users queries to warm the cache

 

Outperform performance of SQL queries

In average, MDX queries do outperform performance of SQL queries.

Cube shines when MDX queries are not very complex and result sets are in average size

SQL shines when SQL queries returns huge result sets.

So, let us to keep our MDX queries relatively simple with small result set to always outperform equivalent SQL queries

 

Using hierarchies effectively

Basically, in SSAS, we have two types of hierarchies, attribute hierarchies and user hierarchies.

Attribute hierarchies are the default hierarchies that are created for each dimension attribute to support analysis. For non-parent-child hierarchies, each attribute hierarchy consists of two levels: the attribute itself and the All level.

From a performance perspective, attributes that are only exposed in attribute hierarchies are not automatically considered for aggregation. This means that no aggregations include these attributes.

 

SSAS lets us build two types of user hierarchies: natural and unnatural hierarchies

In natural hierarchies, process follows the chain of many-to-one relationships that “naturally” exist in our data. Each attribute is directly related to the attribute in the next level of the hierarchy

 

In an unnatural hierarchy the hierarchy consists of at least two consecutive levels that have no attribute relationships. Typically these hierarchies are used to create drill-down paths of commonly viewed attributes that do not follow any natural hierarchy. As far as query performance, natural hierarchies behave very differently than unnatural hierarchies. In natural hierarchies, the hierarchy tree is materialized. In addition, all attributes participating in natural hierarchies are automatically considered to be aggregation candidates. So, we should consider creating natural hierarchies wherever possible. To take advantage of natural hierarchies, you must make sure that you have correctly set up cascading attribute relationships for all attributes participating in the hierarchy.

 

Unnatural hierarchies are not materialized and the attributes participating in unnatural hierarchies are not automatically considered as aggregation candidates. Rather, they simply provide users with easy-to-use drill-down paths for commonly viewed attributes that do not have natural relationships. The performance of the unnatural hierarchies vs. cross-joins at query time is same. Unnatural hierarchies simply provide the added benefit of reusability and central management.

Parent-child hierarchies are hierarchies with a variable number of levels.

Basically they have a recursive relationship between a child attribute and a parent attribute. In parent-child hierarchies, aggregations are created only for the key attribute and the top attribute.

So, need to avoid using parent-child hierarchies that contain large numbers of members at intermediate levels of the hierarchy.

The good example of Parent-child hierarchies is Employee hierarchy (Managers and Employees) . In this example the dimension table/view will have two columns that describe the relation of levels, the child key and the parent key.

This is done by using a primary key and foreign key relation within a single table/view. Parent-child dimensions do not have attribute relations

 

Using partitions for query performance

You can use multiple partitions to break up your measure group into separate physical components. Each partition can be queried separately and the Storage Engine can query only the partition(s) that has the related data.

You can partition data to relate to common queries. A typical example for partitions is by “Time” such as day, month, quarter, year.

Many queries have a time element, so partitioning by time often helps query performance.

In general, the number of records per partition should not be more than 20 million. Also, the size of a partition should not more than 250 MB.

While having multiple partitions is generally beneficial, having too many partitions, e.g., greater than a few hundred, can also affect performance negatively.

If you have partitions that are less than 50 MB or 2 million records, consolidating them into one partition. Also, it is generally not a good practice to create a partition that has less than 4,096 records

 

Please note that for each partition, you can use a different aggregation design. By taking advantage of this flexibility, you can identify those data sets that require higher aggregation design.

While this flexibility can help performance, too many aggregation designs can cause overhead (processing time) and too much I/O issues.

 

Using partitions for processing performance

Basically, using multiple partitions helps the processing performance by giving you ability to process smaller data components of a measure group in parallel

 

Rigid vs Flexible aggregations, query performance

An aggregation is rigid when all of the attributes participating in the aggregation have rigid direct or indirect relationships to the granularity attribute of a measure group.

Please note that if any are flexible, the aggregation is flexible.

An aggregation is flexible when one or more of the attributes participating in the aggregation have flexible direct or indirect relationships to the key attribute.

Please note that if you perform a ProcessUpdate on a dimension that has flexible aggregations, whenever deletions or updates are detected for a given attribute, the aggregations for that attribute as well as any related attributes in the attribute chain are automatically dropped (“Query Performance” impact)

 

The aggregations are not automatically recreated unless you perform one of the following tasks:

1-ProcessFull on the cube, measure group, or partition

2-ProcessIndexes on the cube, measure group, or partition

3-Configure Lazy Processing for the cube, measure group, or partition. Note: If you configure Lazy Processing, the dropped aggregations are recalculated as a background task.

4-Process affected objects.

If you do not follow one of the above techniques, you will have bad query performance.

 

This is very important because by default every aggregation is flexible since every attribute relationship type is set to Flexible.

 

Avoid Many-to-Many dimensions, query performance

Many-to-Many dimension can have huge query performance issue for large intermediate facts since they are not pre-aggregated across dimensions (not materialized).

In Many-to-Many dimensions, during processing, the data and intermediate measure groups are processed independently. Fact data and aggregations for the data measure group do not include any attributes from the many-to-many dimension. When you query the data measure group by the many-to-many dimension, a run-time “join” is performed between the two measure groups using the granularity attributes of each dimension that the measure groups have in common. From a performance perspective, the run-time join has the greatest impact on query performance. Specifically, if the intermediate measure group is larger than the data measure group or if the many-to-many dimension is generally large (~ few hundred thousand), you can experience query performance issues due to the amount of data that needs to be joined at run time. So, while many-to-many relationships are very powerful, to avoid query performance issues, in general you should limit your use of many-to-many relationships to smaller intermediate measure groups and dimensions.

 

There are some workarounds like the following:http://blogs.solidq.com/EN/Erik/Lists/Posts/Post.aspx?ID=1

http://blogs.solidq.com/EN/Erik/Lists/Posts/Post.aspx?ID=3

 

Reference dimensions and their performance

During processing, the reference dimension is processed independently (same as regular dimensions).

If any row in the measure group does not join to the reference dimension, the record is removed from the partition.

Please note that this is different behavior than the un materialized reference relationship where missing members are assigned to the unknown member.

To improve the query performance of reference relationships, you can choose to materialize them. Note that by default, reference relationships are not materialized. When a reference relationship is materialized, the joining across dimensions is performed during processing and not in querying. Also, the attributes in the materialized reference dimensions follow the aggregation rules of standard dimensions.

When you use reference relationships, a reference dimension can only participate in aggregations if it is materialized. These aggregations will not take into account any hierarchies across dimensions, since the reference dimension is analyzed separately from the other dimensions.

 

So, consider the following guidelines:

If your dimension is frequently queried and can benefit from natural hierarchies and aggregations, you should combine attributes from dimension tables into your normal dimension.

If the dimension is not frequently queried and is only used for one-off analysis, you can use un materialized reference relationships to expose the dimension for browsing without the overhead of creating aggregations for dimension attributes that are not commonly queried. If the intermediate dimensions are large, to optimize the query performance you can materialize the reference relationship.

 

Parent-child dimensions and their performance

Parent-child hierarchies are hierarchies with a variable number of levels.

Basically they have a recursive relationship between a child attribute and a parent attribute. In parent-child hierarchies, aggregations are created only for the key attribute and the top attribute.

So, need to avoid using parent-child hierarchies that contain large numbers of members at intermediate levels of the hierarchy.

The good example of Parent-child hierarchies is Employee hierarchy (Managers and Employees) . In this example the dimension table/view will have two columns that describe the relation of levels, the child key and the parent key. This is done by using a primary key and foreign key relation within a single table/view. Parent-child dimensions do not have attribute relations.

Number of levels in such hierarchy depends on the data you have. The hierarchy for parent-child dimension (example, Employee) is not going to show up in AMO. It is not created by user but generated by the server at the time of processing.

 

MDX Performance Hints

I found this very useful.

 

Engine components, single or multi-threaded?

“Formula Engine is single-threaded but the Storage Engine is multi-threaded.” 

So this means:

If calculations are waiting for storage engine queries, then we will see lots of CPU usage (multiple proc machine is helping here)

If a calculation is evaluating (comes from calculation engine) expressions, then we will see just one CPU is working and pegged (multiple proc machine is NOT helping here).

Unfortunately, in complex cubes that have heavy and complex calculations, this is a big issue.

 

Start vs Snowflake schema for UDM/Cube?

This topic is very controversial and I have done it in both ways.A little background first. Start schema is a simplest style of DW schema. It contains fact tables and referenced dimension tables.In Snowflake schema, dimension tables are normalized into multiple related tables. In my testing, in most cases, when the dimension tables are huge/large Snowflake schema performs better/faster in processing of the dimensions.

This better performance of dimension processing is due to the fact smaller tables are joined. During the dimension processing, each attribute runs a SELECT DISTINCT against the source dimensional table. When you have a large Star schema the small attributes have to pay for the lowest level of their related hierarchy. However, when you have the Snowflake schema, the attribute only pays for it’s on related table.

For smaller dimensions this doesn’t matter much, however if you have large dimensions and update quite frequently a snowflake schema can make a huge difference.

More details discussion on this in here

Also, Ralph Kimball has good article in here that encourages (“not only acceptable, but are the keys to a successful design”) using snowflake design for large dimensions like “Customer” and “Product”.

For UDM/Cube guys, Snowflake schema is easier to maintain and increases flexibility. Also, dimension wizard will be able to detect natural hierarchies whereas a star schema won’t.

However, they are harder for ETL guys to write.

Regardless of Star or Snowflake schema, need to make sure that we have correct indexing for tables. Here is a good read and follow.

 

Links?

Analysis Services Performance Guide, 2008 white paper

MDX Query Performance Bottlenecks, Best Practices

Distinct Count Optimization