Skip to main content

Azure Databricks

 What is Azure Databricks?

Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks offers three environments for developing data intensive applications: Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine Learning.

Databricks SQL provides an easy-to-use platform for analysts who want to run SQL queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards.

Databricks Data Science & Engineering provides an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources and turn it into breakthrough insights using Spark.

Databricks Machine Learning is an integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.

Why Databricks:

Speed:


Anyone familiar with Apache Spark knows that it is fast. It can run up to 100x faster than Hadoop MapReduce when running in-memory, or up to 10x faster when running on-disk. Azure Databricks is even faster!

Security:

Azure Databricks integrates directly with Azure Active Directory (AAD) out of the box, with no custom configuration. This differs greatly from Apache Spark on Azure HDInsight, where AAD integration is a premium feature requiring considerable configuration using Apache Ranger.

After creating the Azure Databricks service and initializing the Databricks workspace, users with access can simply go to the workspace URL and log in using their AAD credentials.

Collaboration:

Collaboration is the third reason to choose Azure Databricks for data science and data engineering workloads. Azure Databricks provides a platform where data scientists and data engineers can easily share workspaces, clusters and jobs through a single interface. They can also commit their code and artifacts to popular source control tools, like GitHub.

Within Azure Databricks, users can spin up clusters, create interactive notebooks and schedule jobs to run those notebooks. Using the Azure Databricks portal, users can then easily share these artifacts with other users. This allows users to create and build models together in the same notebook in real time, to re-use data assets, libraries and compute resources across the same cluster, or to re-use and monitor scheduled jobs.

Databricks Concepts:

Some concepts are general to Databricks, and others are specific to the persona-based Databricks environment you are using:

  • Databricks Data Science & Engineering
  • Databricks Machine Learning
  • Databricks SQL

General Concepts:

1. Accounts and workspaces:

In Databricks workspace has two meanings:

  1. A Databricks deployment in the cloud that functions as the unified environment that your team uses for accessing all of their Databricks assets. Your organization can choose to have multiple workspaces or just one: it depends on your needs.

  2. The UI for the Databricks Data Science & Engineering and Databricks Machine Learning persona-based environments. This is as opposed to the Databricks SQL environment.

    When we talk about the “workspace browser,” for example, we are talking about the UI that lets you browse notebooks, libraries, and other files in the Data Science & Engineering and Databricks Machine Learning environments—a UI that isn’t part of the Databricks SQL environment. But Data Science & Engineering, Databricks Machine Learning, and Databricks SQL are all included in your deployed Databricks workspace.

A Databricks account represents a single entity for purposes of billing and support; it can include multiple workspaces.

2. Authentication and Authorization:

User

A unique individual who has access to the system.

Group

A collection of users.

Access control list (ACL)

A list of permissions attached to the workspace, cluster, job, table, or experiment. An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each entry in a typical ACL specifies a subject and an operation.

Databricks Data Science and Engineering:

Databricks Data Science & Engineering is the classic Databricks environment for collaboration among data scientists, data engineers, and data analysts. This section describes the fundamental concepts you need to understand in order to work effectively in the Databricks Data Science & Engineering environment.

Workspace:

A workspace is an environment for accessing all of your Databricks assets. A workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.

This section describes the objects contained in the Databricks workspace folders.

Notebook

A web-based interface to documents that contain runnable commands, visualizations, and narrative text.

Dashboard

An interface that provides organized access to visualizations.

Library

A package of code available to the notebook or job running on your cluster. Databricks runtimes include many libraries and you can add your own.

Repo

A folder whose contents are co-versioned together by syncing them to a remote Git repository.

Data Science and Engineering Interface:

UI

The Databricks UI provides an easy-to-use graphical interface to workspace folders and their contained objects, data objects, and computational resources.


REST API

There are three versions of the REST API: 2.1, 2.0, and 1.2. The REST APIs 2.1 and 2.0 support most of the functionality of the REST API 1.2 and additional functionality and are preferred.

CLI

An open source project hosted on GitHub. The CLI is built on top of the REST API

Data Management In Data Science And Engineering:

Databricks File System (DBFS)

A filesystem abstraction layer over a blob store. It contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can use to learn Databricks.

Database

A collection of information that is organized so that it can be easily accessed, managed, and updated.

Table

A representation of structured data. You query tables with Apache Spark SQL and Apache Spark APIs.

Metastore

The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. You also have the option to use an existing external Hive metastore.


Comments

Popular posts from this blog

Jenkins

Pre-requisites 1. Install a Webserver https://gitlab.com/Azam-devops/webserver/-/blob/main/README.md Code for index.html https://gitlab.com/Azam-devops/webserver 2. Maven Code https://gitlab.com/Azam-devops/imperial-maven-project 1. Install & configure Jenkins Automation Server on Linux Vm. 2. Go through at some of the important options in Jenkins. 3. Manage Jenkins. 4. Plugins 5. Global Tools Configuration. 6. Credentials 7. Users 8. Slave Nodes 9. Configuring CI pipeline using Gitlab. 10. Configuring standalone CICD pipeline using. 11. Automating the CICD pipeline. 12. Jenkins log 13. Introduction to Jenkins file. 14. Basic groovy syntax & file formation. 15. Launching a Pipeline using Jenkins file. 3. DevOps Architecture Description of above DevOps plan. Create Maven based source code in Gitlab. Create a Jenkins job which will execute below stages. Checkout code from Gitlab Build/compile the source code using Maven as a build tool. scan the code virtually. Test...

Docker In Details

  Course Contents:- 1. Overview of Docker 2. Difference between Virtualization & Containerization 3. Installation & Configuration of Docker Runtime on Linux & Windows 4. Practice on Docker commands 5. launch a Webserver in a container 6. Launch public & official images of application like Jenkins, Nginx, DB etc.. 7. Launch a base OS Container 8. How to save changes inside the container & create a fresh image(commit) 9. How to ship image & container from one hardware to another. 10. How to remove stop/rm multiple container/images 11. Docker Registry 12. Docker Networking       Check current docker network                  Docker Network Bridge                     Docker Network Weaving                  Launch our own Docker Cluster with our defined Network             ...

Ansible

  Ansible is an open-source software provisioning, configuration management, and application-deployment tool. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows. It includes its own declarative language to describe system configuration. Ansible was written by Michael DeHaan and acquired by Red Hat in 2015. Ansible is agentless, temporarily connecting remotely via SSH or Windows Remote Management (allowing remote PowerShell execution) to do its tasks. Platform support Control machines have to be a Linux/Unix host (for example SUSE Linux Enterprise, Red Hat Enterprise Linux, Debian, CentOS, macOS, BSD, Ubuntu, and Python 2.7 or 3.5 is required. Managed nodes, if they are Unix-like, must have Python 2.4 or later. For managed nodes with Python 2.5 or earlier, the python-simplejson package is also required. Since version 1.7, Ansible can also manage Windows nodes. In this case, native PowerShell remoting supported by the WS-Managemen...

Basic Linux Commands

  Linux Command Cheat Sheet Hello All, Below are the most common commands used in a day to day life of  linux user. if you are new to linux i will recommend you to go through all of the commands.  this commands will help you to troubleshoot linux issues.   Command Description ls Lists all files and directories from present working directory ls-R Lists files in sub-directories ls-a to list down hidden files. ls-al Lists files and directories with complete details like permissions, size, owner cd or cd ~ To go back to home directory cd .. Move one level up cd To change to a particular directory cd / Move to the root directory cat > filename Creates a new file cat filename Displays the content of a file cat file...

Kubernetes-Update

                                                    https://kubernetes.io/ Kubernetes (K8s)  is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. Kubernetes builds upon  15 years of experience of running production workloads at Google , combined with best-of-breed ideas and practices from the community. Latest Verion:-  1.19 Kubernetes Objects Kubernetes defines a set of building blocks ("primitives"), which collectively provide mechanisms that deploy, maintain, and scale applications based on CPU, memory or custom metrics. Kubernetes is loosely coupled and extensible to meet different workloads. This extensibility is provided in large part by the Kubernetes API, which is used by int...