Data Enrichment
Data Enrichment
-
Data enrichment is one of the key processes by which you can add more value to your data. It refines, improves and enhances your data set with the addition of new attributes. For example, using an address post code/ZIP field, you can take simple address data and enrich it by adding socio economic demographic data, such as average income, household size and population attributes. By enriching data in this way, you can get a better understanding of your customer base, and potential target customers.
-
Enrichment techniques
There are 6 common tasks involved in data enrichment:
- Appending Data
- Segmentation
- Derived Attributes
- Imputation
- Entity Extraction
- Categorization
1. Appending Data
By appending data to your data set, you bring multiple data sources together to create a more holistic, accurate and consistent data set than that produced by any one data source. For example, extracting customer data from your CRM, Financial System and Marketing systems and bringing those together will give you a better overall picture of your customer than any one single system.
Appending data as an enrichment technique also includes sourcing 3rd party data, such as demographic or geometry data by post code/ZIP and merging this to your data set as well. Other useful examples can include exchange rates, weather data, date / time hierarchies, and traffic. Data Enriching location data is one of the most common techniques, as this data is readily available for most countries.
Read more: Yellowfin geography packs
2. Data Segmentation
Data segmentation is a process by which you divide a data object (such as a customer, product, location) into groups based on a common set of pre-defined variables (such as age, gender, income, for customers). This segmentation is then used as a way to better categorize and describe the entity.
Common segmentation examples for customers include:
- Demographic Segmentation – based on gender, age, occupation, marital status, income, etc.
- Geographic segmentation – based on country, state, or city of residence. Local businesses may even segment by specific towns or counties.
- Technographic segmentation – based on preferred technologies, software, and mobile devices.
- Psychographic segmentation – based on personal attitudes, values, interests, or personality traits.
- Behavioral Segmentation – based on actions or inactions, spending/consumption habits, feature use, session frequency, browsing history, average order value, etc.
These can lead to groups of customers such as Trend Setters, Tree Changers etc.
By creating either calculated fields in either a ETL process or within a meta-data layer, you can create your own segmentation based on the data attributes you have.
3. Derived Attributes
Derived attributes are fields that are not stored in the original data set but can be derived from one or more fields. For example ‘Age’ is very rarely stored but you can derive it based on a ‘date of birth’ field. Derived attributes are very useful because often they contain logic that is repeatedly used for analysis. By creating them within an ETL process or at the meta-data layer you are able to reduce the time it takes to create new analysis as well as ensure consistency and accuracy in the measures being used.
Common examples of derived attributes include:
- Counter Field – based on a unique id within the data set. This allows for easy aggregations.
- Date Time Conversions – using a date field to extract the day of week, month of year, quarter, etc.
- Time Between – by using to date time fields you can calculate period elapsed such as response times for tickets, etc.
- Dimensional Counts – by counting values within a field you can create new counter fields for a specific area. Such as Count of Narcotic offences, Weapons offences, Petty crime. This allows for easier comparative analysis at the report level.
- Higher order Classifications – Product Category from product, Age band from Age.
Advanced derived attributes can be the results of data science models being run against your data. For example: Determining customer churn risk, or propensity to spend can be modelled and run.
4. Data Manipulation
Data imputation is the process of replacing values for missing or inconsistent data within fields.
Rather than treating the missing value as a zero, which would skew aggregations, the estimated value helps to facilitate a more accurate analysis of your data.
For example: If the value for an order was missing, you could estimate the value based on previous orders by that customer, or for that bundle of goods.
5. Entity extraction
Entity extraction is the process of taking unstructured data or semi-structured data and extracting meaningful structured data from that element.
When entity extraction is applied, you are able to identify entities—people, places, organizations and concepts, numerical expressions (dates, times, currency amounts, phone numbers) as well as temporal expressions (dates, time, duration, frequency).
Taking a simple example you could, by data parsing, extract a persons name from an email address or the organization web domain to which they belong or split names, addresses, and other data elements into discrete data elements from an envelope type address into building name, unit, house number, street, postal code, city, state/province and country.
6. Data Categorization
Data categorization is the process of labelling unstructured data so that it becomes structured and able to be analyzed. This falls into two distinct categories:
- Sentiment analysis – the extraction of feelings and emotions from text. For example was the customer feedback frustrated or delighted, positive or neutral.
- Topication – determining the ‘topic’ of the text. Was the text about politics, sport or house prices?
Both of these techniques enable you to analyze unstructured text to get a better understanding of that data.
-
Data Enrichment Best Practices
Data enrichment is rarely a once-off process. In an analytics environment where new data is being fed into your system on an ongoing basis, your enrichment steps will need to be repeated on an ongoing basis. A number of best practices are required to ensure your desired outcomes are met, and that your data remains of a high quality. These include:
Reproducibility and consistency
Each data enrichment task must be reproducible and generate the same expected results on a consistent basis. Any process you create needs to be rules driven, so that you can run it repeatedly with the confidence that you will always have the same outcome.
Clear Evaluation Criterion
Each data enrichment task must have a clear evaluation criterion. You must be able to assess that the process has run and has been successful. For example, after running a process, you are able to compare the recent outcomes with prior jobs, and see that the results are as expected.
Scalability
Each data enrichment task should be scalable in terms of resources, timeliness and costs. Assuming that your data will grow over time any process that you create should be able to be maintained as your data grows, or you add other processes to your transformation workloads. For example, if your process is entirely manual, you will very quickly be constrained by the mount you can process within a required time and the process will be cost intensive, so automate as much as possible using infrastructure that can easily grow with your needs.
Completeness
Each data enrichment task must be complete with respect to the data that is input into the system producing results with the same characteristics. This means that for any intended output you have anticipated all possible results this can include cases where the results are ‘unknown’. By being complete when new input data is added to the system you can be assured that you will always have a valid outcome from the enrichment process.
Generality
The data enrichment task should be applicable to different data sets. Ideally, the processes that you create will be able to be transferable to different datasets so that you can reuse logic for multiple jobs. For example day of week extraction should be applied in exactly the same way to any date field. This ensures consistency of outcome, and helps to maintain the business rules associated with your data across all your subject domains.