Standardising our data mining processes using CRISP-DM

In recent months, we have been adopting the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model. CRISP-DM is a framework that provides a structured approach to planning and delivering data mining/analysis projects. It is a proven, robust, and widely used methodology both in industry and academia. Figure 1 illustrates the 6 phases that make up the CRISP-DM process:

Standardising our data mining processes using CRISP-DM

Figure 1: The CRISP-DM process

Importance of the business understanding phase

One of the key features of CRISP-DM, that sets it apart from other similar processes and methodologies, is the importance it places on the business understanding phase. Although CRISP-DM is not designed to be a linear process, it is recommended that a project starts at this phase. The business understanding phase consists of four essential tasks:

  • Defining business aims and objectives
  • Assessment of the current situation
  • Proposing data mining goals and success criteria
  • Creating a project plan for delivery

We have found that clearly defining and documenting business aims and objectives at the start of a project has proven highly beneficial. Starting here has allowed our data analysts to take a step back, think of the bigger picture, and understand where the project sits as part of our broader business strategy. At this stage, the data analyst is required to pause and properly deconstruct and understand the problems that need to be solved. If these problems are not appropriately addressed, there is a risk that the project outcomes may be limited or even useless. While many teams hurry through this phase, we have experienced that establishing a solid business understanding and defining SMART (i.e. specific, measurable, attainable, relevant and time-bound) objectives creates an invaluable and secure foundation for the project. Having a clear set of predefined objectives also provides the data analyst protection from project supervisors and clients, who may try to add in additional work after the project has begun. This prevents scope creep and ensures that we deliver value to our clients.

Efficient project planning

Another significant benefit of CRISP-DM has been the requirement to define data mining goals and outcomes and create a finite project plan to help deliver them. This has enabled our data analysts to plan and manage their projects much better. Although CRISP-DM is not a project management process, we have found that incorporating it alongside agile principles and practices provides the added benefit of:

  • fast and continuous delivery from multiple projects;
  • stakeholders being able to provide ongoing meaningful feedback; and
  • flexibility to adjust projects as business strategies and priorities change.

Often projects begin with many unknowns. By adopting an agile CRISP-DM approach, our data analysts have been able to gain a deeper understanding of the data and the problem. The knowledge learned from previous cycles can then go on to support future projects.

Effective data management

Data preparation is one of the most critical and time-consuming phases of CRISP-DM. Some data analysts claim that approximately 50-70% of a project’s time and effort is spent in this phase. This has been our experience also. However, we tend to use the same data sets for most of our projects (e.g. DMA attribute data, flow/pressure data, etc.) and by devoting some extra time and effort to set up secure data pipelines and efficiently managed data warehouses, we can significantly minimise future data preparation overheads by provisioning prepared data sets so that they can be utilised by the entire business where required.

Conclusion

These are just some of the main benefits we have gained by adopting the CRISP-DM approach. Going forward, we hope to review, refine, and customise these processes so that we can continue to be more efficient in our delivery of data mining projects.

If you are interested in finding out more about using CRISP-DM for water network data analysis, please reach out to one of our team to hear more about our experiences.

Related articles