Big Data News Hubb
Advertisement
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
Big Data News Hubb
No Result
View All Result
Home Big Data

Announcing Ray support on Databricks and Apache Spark Clusters

admin by admin
March 4, 2023
in Big Data


Ray is a prominent compute framework for running scalable AI and Python workloads, offering a variety of distributed machine learning tools, large-scale hyperparameter tuning capabilities, reinforcement learning algorithms, model serving, and more. Similarly, Apache Spark™ provides a wide variety of high-performance algorithms for distributed machine learning through Spark MLlib and deep integrations with machine learning frameworks including XGBoost, TensorFlow, and PyTorch. In order to build the best models, machine learning practitioners frequently need to explore multiple algorithms, often requiring the use of multiple platforms including both Ray and Spark. Today, with the release of Ray version 2.3.0, we are excited to announce that Ray workloads are now supported on Databricks and Spark standalone clusters, dramatically simplifying model development across both platforms.

Create a Ray cluster on Databricks or Spark
To start Ray on your Databricks or Spark cluster, simply install the latest version of Ray and call the ray.util.spark.setup_ray_cluster() function, specifying the number of Ray workers and the compute resource allocation. Any Databricks cluster with Databricks Runtime version 12.0 or above is supported, as well as any Spark cluster running version 3.3 or above. For example, the following code installs Ray in a Databricks notebook and initializes a Ray cluster with two worker nodes:


# Install Ray with the ‘default’, ‘rllib’, and 'tune' extensions for 
# Ray dashboard, reinforcement learning, and tuning support
%pip install ray[default,rllib,tune]>=2.3.0

from ray.util.spark import setup_ray_cluster

setup_ray_cluster(num_worker_nodes=2)

With just a few lines of code, you have created a Ray cluster and are ready to start training models.

Train models with Ray Train and Ray RLlib
Now that you’ve started a Ray cluster, it’s time to harness the power of distributed machine learning to build a model. All Ray applications and Ray-integrated machine learning algorithms are supported on Databricks and Spark clusters without any modifications. For example, you can use the Ray Train API in your Databricks notebook to easily distribute your XGBoost model training, reducing training time and improving model accuracy:


# Install xgboost-ray for distributed XGBoost training on Ray
%pip install xgboost-ray

import pandas as pd
import ray.data
from ray.air.config import ScalingConfig
from ray.train.xgboost import XGBoostTrainer
from sklearn.datasets import fetch_california_housing

housing_dataset = fetch_california_housing(as_frame=True)
housing_df = pd.concat(
    [housing_dataset.data, housing_dataset.target], axis=1
)

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(num_workers=2),
    label_column="MedHouseVal",
    num_boost_round=20,
    params={
        "objective": "reg:squarederror",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": ray.data.from_pandas(housing_df)}
)
training_result = trainer.fit()

Ray also provides native support for reinforcement learning. For example, you can run the following Ray RLlib code in your Databricks notebook to train a PPO reinforcement learning algorithm in the Taxi Gymnasium environment:


from ray.rllib.algorithms.ppo import PPOConfig

config = (  # 1. Configure the algorithm,
    PPOConfig()
    .environment("Taxi-v3")
    .rollouts(num_rollout_workers=2)
    .framework("tf2")
    .training(model={"fcnet_hiddens": [64, 64]})
    .evaluation(evaluation_num_workers=1)
)

algo = config.build()  # 2. build the algorithm,

for _ in range(3):
    print(algo.train())  # 3. train it,

algo.evaluate()  # 4. and evaluate it.

For additional model training information and examples, check out the Ray Train documentation and the Ray RLlib documentation.

Find optimal models with Ray Tune
To improve the quality of your models, you can also leverage Ray Tune to explore thousands of model parameter configurations in parallel at scale. For example, the following code uses Ray Tune to optimize a scikit-learn classification model:


# Install the scikit-learn integration for Ray Tune
%pip install tune-sklearn

from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from ray.tune.sklearn import TuneGridSearchCV

X, y = load_iris(return_X_y=True)
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}
tune_search = TuneGridSearchCV(
    SGDClassifier(), parameter_grid, max_iters=10
)
tune_search.fit(X, y)
best_model = tune_search.best_estimator

More information and examples about model tuning on Ray, including the use of Ray with MLflow, is available in the Ray Tune documentation.

View the Ray dashboard

After starting Ray on a Databricks cluster, a link to the Ray dashboard is displayed.

Throughout model development, you can monitor the progress of your Ray machine learning tasks and the health of your Ray nodes using the Ray dashboard. When you create your Ray cluster, the ray.util.spark.setup_ray_cluster() displays a link to the Ray dashboard.

The Ray dashboard provides a detailed view of your cluster's nodes, actors, logs, and more.
The Ray dashboard provides a detailed view of your cluster’s nodes, actors, logs, and more.

The Ray dashboard provides a comprehensive view of Ray cluster’s nodes, actors, metrics, and event logs. You can easily view resource utilization metrics for individual nodes and aggregate metrics across all nodes. For more information about the Ray dashboard, visit the Ray dashboard documentation.

Get started with Ray on Databricks or Spark today
With the availability of Ray 2.3.0, you can start running Ray applications on your Databricks or Spark clusters today. If you’re a Databricks customer, simply create a Databricks cluster with version 12.0 or higher of the Databricks Runtime and check out the Ray on Databricks documentation to get started. Finally, instructions for launching Ray on a standalone Spark cluster are provided in the Ray on Spark documentation, and you can visit https://docs.ray.io/en/latest/ to learn more about machine learning on Ray.

We are very excited about this step forward in interoperability for distributed machine learning and look forward to powering your Ray applications on Apache Spark™ and Databricks!



Source link

Previous Post

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Next Post

Infographic: Is AI the Next Gold Rush?

Next Post

Infographic: Is AI the Next Gold Rush?

Recommended

Upgrade Your Objects in Hive Metastore to Unity Catalog

December 3, 2022

30 Popular Best Data Science Tools to use in 2023

February 22, 2023

The Most Common Bias in Data Is Gender Bias — Here’s How to Prevent It

December 29, 2022

Don't miss it

News

How Enterprises Can Defray the Hidden Cost of the Cloud

March 23, 2023
Big Data

Evolution through large models

March 23, 2023
Big Data

Observe Everything – Cloudera Blog

March 22, 2023
Big Data

NVIDIA Launches Inference Platforms for Large Language Models and Generative AI Workloads

March 22, 2023
Big Data

Announcing the General Availability of Private Link and CMK for Databricks on AWS

March 22, 2023
News

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

March 22, 2023

big-data-footer-white

© 2022 Big Data News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Big Data
  • News
  • Contact us

Newsletter Sign Up

No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us

© 2022 Big Data News Hubb All rights reserved.