Is your data dirty?

A credible cloud billing solution must be able to provide insightful breakdowns of charges. In addition to per-customer reports it should be possible to ‘slice and dice’ data at deeper levels such that the total spend can be dynamically visualised using arbitrary combinations of criteria such as business unit, region, project and resource group to name only a few.

Ultimately this process is completely dependent on the initial data imported into the system. The phrase “garbage in, garbage out” is particularly apposite when generating financial reports. However with the right tools, knowledge and techniques it may be surprising what can be accomplished with data that would otherwise be difficult or impossible to report against.

Here at Exivity we use the term dirty data to describe information that is not well-formed, which is inconsistent or has holes in it. Typically, dirty data is interleaved within well-formed data and as such needs to be identified and dealt with as it occurs.

Usage, service, rate and organisational data

Useful and accurate reports require answers to the following questions for each record in the usage data. Each depends on correlating against a different category of data as shown below:

  1. How much of each Cloud resource was consumed? – usage data
  2. What services are associated with that consumption? – service catalogue data
  3. How should the charges be calculated? – rate card data
  4. Against which billable entity should the charge be levered? – organisational data

There is no rule that states that all the data must be obtained from one source. Multiple APIs may be involved and a single report may encompass public and private clouds from different vendors. It is not uncommon to retrieve usage data from the cloud API and to retrieve organisational and rate data from alternative endpoints within the same API, via an internal CMDB or a combination of the two for example.

Additionally it may be a requirement that the rates obtained from a public cloud API are overridden such that downstream customers can be provided with different offerings. This process may also involve the creation of custom services which supplement or altogether replace those that come from the public cloud.

Missing and incomplete data

Usage and rate data obtained from a cloud API contains provider-generated information that is usually consistent and includes fields such as subscription IDs, service IDs and metadata pertaining to locations and service groups. However it is possible that some records do not contain values for all of these fields.

When preparing information for reporting purposes it is therefore important to be able to detect missing values and populate them with defaults or more meaningful content derived via the process of using a reliable field (such as a subscription ID) as a key to retrieve the appropriate value from another data source such as a CMDB.

This process produces a dataset with values in all the key fields. Although that value may be a default value in the worst case, it is still possible to report against it by simply grouping all records with similar defaults together and applying a rate-plan to them.

This technique results in a report containing information and costs associated with records that would otherwise have been useless. Despite not mapping these charges to a specific customer, applying a default rate plan to them at least identifies areas of remedy whilst providing a good overview of the costs and charges associated with the incomplete data.

If the report breaks down charges by multiple criteria then by placing default or lookup-result values in more granular positions such as a business unit or project, the correct billable entity will still be invoiced at a higher level albeit with charge buckets for otherwise unknown subdivisions of that entity.


Missing and incomplete data: a worked example

Consider the following hypothetical cloud spend:

In the above diagram the charges are broken down first by Region and then by Department. Note that for a significant subset of the total spend the region is indeterminable due to missing or incomplete information.

Although the department is also unknown in places this is less of an issue insofar as related charges can be associated with the correct region.

For the purposes of illustration let us assume that there is a column in the data identifying the owner of any particular resource. This column contains an e-mail address, provided when the service was originally instantiated. The e-mail address belongs to a domain based on the region to which the service belongs as follows:

  • UK:
  • NL:
  • US:

Armed with this information we can now attempt to resolve unknown regions by analysing the email addresses and updating the data accordingly.

A business rule is therefore inserted into the billing system which enacts the following logic:

  • If the region is blank and the owner is not blank:
    • Extract the domain from the value in the owner column
    • Based on the that domain, update the region

The owner column may not be populated in all cases but for those where it is we can now assign the service to whichever region is associated with the owner:

In this example we have assigned 50% of the unknown spend at the region level to a department, using a simple business rule and without having to update any of the original source data.

With further analysis it is common for further relationships in the data to be identified through which the unknown departments can be similarly resolved. Alternatively a lookup table may be manually generated as a short term measure so that that all unknowns can be handled and a completely accurate report generated. This lookup table would then be phased as the source data is corrected over time.

Inconsistent data

Many organisations use naming conventions to encode metadata into fields such as hostnames, datastore names and resource IDs. This metadata will be highly specific to the organisation and may include information about storage classes, custom tags, VM purposes, service owners, downstream customer identifiers and so on.

Over time these conventions can change and this, combined with good old-fashioned human error, means that some of this encoded metadata will be incorrect. There may also be typing mistakes, inconsistent use of capitalisation or omissions of data altogether.

Many organisations are unaware of the true levels of inaccuracy in their metadata and once identified it may be difficult to resolve all the issues in the short term due to operational dependencies on the incorrect fields.

The quality of this metadata can be improved through using lookups to map the incorrect field names to corrected versions within the cloud billing solution itself. Thus the reports may show more coherent information without requiring that all inconsistencies in the original data are immediately fixed at the source level.

As well as lookups a cloud billing solution should offer generic data manipulation functions such as case normalisation, the extraction of identifier fields within longer strings, range checking and the ability to create new fields based on combining existing fields to form compound identifiers. These are particularly useful when selectively applied based on filters that determine what to modify and what to leave alone.

In short, the ability to intelligently process the data using business rules agreed with, and specific to, your organisation enables a wide range of inconsistencies to be catered for.

Embedded data

The process of extracting data consists of querying an API or other data source, receiving responses in JSON, XML or other formats and then parsing this returned data into tabular format for subsequent transformation.

Each field derived through this process is represented as a cell value in the tabular output and such values are typically self-contained in nature. That said, some APIs may include fields that require further examination.

The JSON fragment below shows part of the response from a Microsoft Azure Stack API query. Note that the value of the instanceData field is actually an entire JSON document in its own right.

This raises the following two considerations:

  1. The embedded data is likely to contain sequences of characters or separators that if treated literally would violate the constraints of the formal tabular format being constructed during the extraction process
  2. The embedded data may require subsequent parsing in order to extract meaningful information from it

An effective ETL process should handle both of these scenarios gracefully (as well as those where embedded data is nested to arbitrary depths), recognising when data is embedded in this manner and implementing the ability to interrogate it if required.

By leveraging our USE (Unified Scriptable Extractor) technology, Exivity is able to handle the above embedded data by first extracting the literal value of the instanceData field and then re-parsing that extracted value such that the resourceUri1, location, tags and additionalInfo values within it can be extracted and added to the tabular output from the collection process. Our out-of-the-box Azure Stack USE template does exactly this.

About Exivity

Exivity is an innovative cloud billing solution that covers all of the above situations and more. Our powerful extraction, parsing, processing and reporting capabilities offer huge flexibility when handling data extracted from any number of public, private or hybrid sources even if some of that data is dirty.

We continually draw on our experiences in the field, implementing support for additional capabilities as different challenges arise. Our team collectively have decades of experience in cloud billing and have successfully delivered against complex requirements for a number of leading businesses throughout Europe and the US.

Tags: ,

Delivering Billing & Reporting Solutions for Hybrid IT Environments. On-Premise and in the Cloud.


Contact Us

Rigakade 10
1013 BC
The Netherlands