8 Reasons to Add Databricks to Your Stack Today
The Bulletproof ETL, AI & Analytics Platform Engineers Need to Try
Databricks, the lakehouse company, is on a mission to help data teams solve the world’s toughest problems. Founded by the original creators of Apache Spark™, Delta Lake, and MLflow; Databricks is relied upon by thousands of organisations worldwide — including Comcast, Condé Nast, Nationwide, and H&M — to provide an open and unified platform for data engineering, machine learning, and analytics.
With several alternatives out there to Databricks’ zero-management unified analytics cloud platform — including AWS EMR, Cloudera, Snowflake, and Treasure Data — you might wonder, why Databricks?
Our engineers found Databricks to be the ‘best in class’ option, when trialling several big data and ETL solutions for a project in rail. We fell in love with Databricks’ toolkit, pricing model and welcoming community. Although it might not be the preferred solution for every team, a plethora of clear use cases meant we made the switch, even becoming official Databricks Partners – and our engineers haven’t looked back!
So in answer to the key questions – ‘is Databricks good’ and ‘should I use Databricks‘ – we would say, yes! Here are the eight reasons why:
1: Slots into Any Stack
Databricks is cloud-native, meaning it sits right on top of Azure, AWS, and Google Cloud environments. By offering this wrapper around core data storage, governance/management, and science tools Databricks combats the age-old problem of fragmentation – ensuring accessibility and cohesion of data architecture for enterprise.
Engineers no longer need to fiddle around creating, testing, and deploying MapReduce or Spark jobs using a Hadoop distribution – and instead can focus on developing right inside Spark’s native interfaces for Python, R, Scala, or SQL as preferred.
2: Never Get Lost in the Source
The list of Spark data sources usable in Databricks is huge. Most are directly supported in Databricks Runtime, although a few do require use of some (very simple) shell commands to enable read/write access. Old faithfuls like MongoDB, Couchbase, JSONs, and Parquet files connect like a dream – minimising the work needed to ensure compatibility with your existing data source structure.
3: Next Generation Notebooks
Whether you’re a notebook convert or refuse to acknowledge Jupyter refers to anything except a planet, Databricks’ take on the modular coding tool will have something to offer you. Unlike the standard Jupyter notebook, Databricks Notebooks allow for cells to be written in different languages – Python, R, Scala, and SQL. This allows specific operations to be performed in the most suitable language regardless of the contents of the rest of the notebook; offering flexibility to developers that isn’t found elsewhere.
Of course this language mixing can introduce some complexity, and with different specialisms across development and engineering teams collaboration and versioning are going to be essential in keeping Databricks Notebooks coherent. Thankfully, collaborative working comparable to Google Docs and Colab is enabled by default: letting teams come together to write code in the same place at the same time. Strong version control also comes as standard, and integration with Git is simple to enable – allowing teams to control their product and utilise their repo of choice, be that Bitbucket or GitHub.
In-line visualisations are a welcome, albeit standard, notebook feature. Where Databricks stands apart here is that its cloud integrations mean notebooks can pull from big data sources in the cloud, offering quick graphical outputs for your huge, aggregated datasets in a way not matched by other options. A subtle benefit of the notebook structure is that the hidden data sharing between cells permits much faster calculation speeds than is standard – transforming one of notebooks’ greatest weaknesses into a surprising strength!
4: Macro- to Micro-Scalability
No job is too big for Databricks, and no job is too small either! Smooth cloud integrations that can reach across platforms like Azure Data Lake or Amazon Redshift can level up your data engineering in terms of dataset size, pipeline performance, and analytics fidelity. Granular scalability and the smooth Databricks Notebook also allow small scale development, provisioning, and testing tasks to be performed in the same environment as your big data tasks.
Databricks means the age of compulsory multi environment cloud solutions is over – no longer are developers forced into working on separate environments/VMs, avoiding interplatform compatability issues and cutting costs.
5: Go with the MLflow
If you’ve ever tried to develop a machine learning (ML) model, you know how complicated things can quickly become. There are hundreds of tools for every phase of the ML lifecycle, and a thorough engineering job is probably going to require you to try every algorithm available to see if it improves results. This means productionising multiple libraries, which can really slow down and complicate your workflow! Actually keeping track of which code, data, and parameters have been implemented in which test-case models can be a further nightmare. Combine all this with the broad spectrum of environments your model could need to work in and you have a recipe for time wasting and risky deployment.
Databricks solves this with Managed MLflow; an open-source open-interface ML platform that helps engineers manage the ML lifecycle. Managed MLflow offers a tracking tool which logs and compares run outputs, a projects component for organising data science code into Git repos, and a models convention which packages ML models in multiple flavour formats for quick and flexible deployment – all sharable on a collaborative repository. Managed MLflow is a game changer, and essential for any modern ML implementation.
6: Azure Sky Thinking
A platform-optimised version of Databricks is available to users of Microsoft’s Azure cloud service. The benefits are numerous: for example, by using the Azure Active Directory security framework, Azure Databricks easily integrates with your entire data ecosystem – pulling from Data Warehouse, Blob, Data Lake, and Event Hub using your existing credentials for authorisation. Reducing the number of seperate Azure services required for data jobs is another major benefit, and users will be pleased to find Databricks consitutes a unified and effective alternative to the often costly options of Data Lake Analytics and HDInsight.
If you’re not using Microsoft’s platform right now, it is worth noting that Azure Databricks can supercharge the entire Azure stack – so it might be worth exploring if, with Azure Databricks, Microsoft’s platform could work for you. That said, Azure’s industry-leading ETL solution (Data Factory) now has a legitimate alternative in Databricks, which outcompetes it in terms of pricing, performance, and usability – so if entering the Microsoft ecosystem isn’t right for you, Databricks may well still be.
7: Detailed Documentation
We’ve all been there, scrabbling around in bad documentation trying to find the answer to our highly specific question. CTRL+F has uncovered nothing, and so we turn to StackOverflow for respite, but alas the only time a similar question has been asked was nine years ago – the response from SpiffingBritProgrammer: “Check the documentation.
You won’t encounter this problem with Databricks. Extensive documentation is available for both the platform and its supported languages. Microsoft even offers an expansive breakdown of their specialist Azure Databricks platform, as well as live-chat and forum support. The Databricks Community is also strong and positive – with a number of helpful experts waiting in specialised groups or open forum discussion boards to help you with your queries. Our engineers have experienced reaching out to the Databricks Community for aid; and always receive useful, friendly responses much faster than using generic crowdsourced support options.
8: Cost Efficiency
For the business decision makers in the room, this is the big one! All of the above features are available commitment-free, under an incredibly competitive pricing model – with a 14-day free trial that gives you a no-cost taste of what Databricks can do.
The platform is by default pay-as-you-go, meaning you just pay for the resources you use on a per-second basis – eliminating overspend whilst providing unparalleled scalability and flexibility. But, if it works for your team, Databricks also offer committed-use discounts. This means that if your workload is predefined or relatively predictable, you can commit to specified usage levels and receive substantial discounts. Something to note here is that many of you are already going to be thoroughly forecasting usage to create in-depth pitches or project budgets, meaning work that’s being done anyway can be repurposed to help make some real savings.
A fair cost structure combines with this blended pay-as-you-go and committed-use billing model to make Databricks one of the most affordable ETL and analytics platforms out there for a huge range of use cases. When you layer in its incredible next gen functionality, strong community, and robust flexibility; it becomes clear that Databricks is one of the most cost efficient platforms on the market today. Although some other providers do work out cheaper per processing unit for some implementations, nowhere else can engineers extract such utility at a flexible and competitive cost.
Closing Thoughts
The word ‘overengineering’ is enough to frighten any tech specialist, so as engineers taking pride in efficient use-case appropriate solutions we at Distributed Analytics had always been nervous to experiment with specialised big data platforms. Our engineers were especially worried that notebook interfaces could hamper control, and automatic distribution of processing would lead to unforeseen and expensive pipeline complexities.
With Databricks, these concerns were completely unfounded. The platform tessellates perfectly with existing architecture, offers intuitive collaboration, and makes version control easy, all at a competitive price. The cost efficiency is incredible in comparison to competitor platforms. This keeps our team’s code quality high, and their stress level low.
Databricks is clearly a next gen platform, with multiplatform interoperability facilitating a resilient stack, fluid environments and cohesive data fabric. This means implemented solutions remain flexible (and therefore viable) long term. Engineers and clients no longer need to construct everything inside one cloud platform using its proprietary toolkit, and are saved from investing millions into static stacks. Our team would encourage every organisation to take a look at Databricks – especially if they use Azure.