Which supplier is the best for storing your data? MarkLogic, MongoDB, Hadoop or another one? In this blog I have developed my own scoring system to rate each of these suppliers. And to answer this crucial question: Which supplier scores best?
I chose different guidelines for how to perform a database/tool selection. Here’s the result from a developer’s perspective.
Just like when purchasing a car, you also take a number of factors into account when selecting a tool/database. These are consideration factors. I have removed the weighting factors in which the parties all score the same from the list, because these values will not make a difference in the result. These include: Security, Active Directory Support (LDAP) and Virtualization. All suppliers are compliant, so they are not included in the weighting list.
Below I describe the weighting factors that in my opinion you should take into account when selecting a database and tooling:
By completeness I mean: what does the organization get after the installation and can you get started right away? I looked at the following aspects:
- full development environment;
- monitoring of processes, incoming and outgoing data and response time;
- supporting multiple query languages and applications.
For the score, this is not only about the costs you pay to the supplier (licences), but I also take indirect costs into account in the weighting. Such as implementation and application costs that are necessary to work.
For brand awareness I use the quadrant of Gartner. Is it easy to hire expertise? And what does the supplier do to get its product into the community (free developer edition, forums, etc.)
Integration is not only the ease with which different applications can be linked together, but also the number of tools that are needed to do your job.
How easily can you scale up and down to influence the speed. The following factors are included in the weighting:
- Scale Out support: this means linking different devices instead of making 1 server heavier. This technique is often less expensive than making the server heavier.
- Map Reduce support: In addition to the Scale Out method, the map reduce is a technique with which data can be collected and displayed quickly over a cluster of machines.
I don’t just look at the investment of energy, time and money to master this technique, but also at the range of documentation, courses and support from the suppliers.
This refers to the response time for saving the data and the response time for retrieving the data.
This is the diversity of BI, Analytics and Data Science applications you can run after installing the database.
Different file types
To what extent can you absorb and index different file types (json, xml, csv, video, audio and photos) with this tool. Indexing the data is important to be able to search the files based on characteristics.
It is also important to take into account the environmental factors that play a role within the organization.
Knowledge in the organization
The tooling and database must match the organisation’s knowledge. Organizations that build up a lot of technology and knowledge in the organization can handle a Big Data database that consists of more layers (and therefore more configuration).
The learning curve is the time it takes the organization to master the database and tooling. This again depends on the knowledge and skills of developers and administrators.
Depending on the architecture chosen, the integration of various sources is also a must. In this case it is important that the Big Data database can communicate with SQL databases and Message Queing (MQ service buses) series.
This concerns the ease and therefore the speed with which software/processes can be transferred to different environments (DTAP street) and the management of different versions.
For each weighting factor I determine the maximum score. I do this because I do not consider all weighting factors to be equally important (more on this later). Therefore, not all of these factors can achieve an equally high score.
For scoring/weighing the weighting factors mentioned above, I use a scale division between the 1 to maximum score. 1 is the worst score and a maximum is the best.
Distribution Max. score
If you plot this score in a matrix, you will get the result as shown in figure 1. In this matrix I have only plotted the score based on technical (im)possibilities. The business part depends on the situation and application and must be entered in the second column.
Table 1: Maximum score for the weighting factors
When this score is plotted in a matrix, you get the result shown in figure 1. In this matrix I have only plotted the score based on technical (im)possibilities. The business part depends on the situation and application and must be entered in the second column.
Figure 1: Score Matrix for Big Data Database
Score from Suppliers
To score the different suppliers on these factors I use the formula: package score factor = ((package weighting score * package weighting score ) + (package weighting score * weighting score))/10. This formula ensures that the score can never be higher than indicated for the maximum score.
In the matrix you can see exactly which supplier scores best. As you can see, the Azure SQL Server ranks best. However, the choice also depends on the purpose for which you want to use the database. And then it is possible that there is a different outcome. My advice is to choose Marklogic if you are going for a database based on Data Lake. In the next blog ‘This is the best data storage provider‘ I explain why.