Black Cat Security

Next Generation SageMaker Notebooks – Now with Built-in Data Preparation, Real-Time Collaboration, and Notebook Automation

In 2019, we introduced Amazon SageMaker Studio, the first fully integrated development environment (IDE) for data science and machine learning (ML). SageMaker Studio gives you access to fully managed Jupyter Notebooks that integrate with purpose-built tools to perform all ML steps, from preparing data to training and debugging models, tracking experiments, deploying and monitoring models, and managing pipelines.

Today, I’m excited to announce the next generation of Amazon SageMaker Notebooks to increase efficiency across the ML development workflow. You can now improve data quality in minutes with the built-in data preparation capability, edit the same notebooks with your teams in real time, and automatically convert notebook code to production-ready jobs.

Let me show you what’s new!

New Notebook Capability for Simplified Data Preparation
The new built-in data preparation capability is powered by Amazon SageMaker Data Wrangler and is available in SageMaker Studio notebooks.  SageMaker Studio notebooks automatically generate key visualizations on top of Pandas data frames to help you understand data distribution and identify data quality issues, like missing values, invalid data, and outliers. You can also select the target column for ML models and generate ML-specific insights such as imbalanced class or high correlation columns. You then receive recommendations for data transformations to resolve the issues. You can apply the data transformations right in the UI, and SageMaker Studio notebooks automatically generate the corresponding transformation code in the notebook cells that you can use to replay your data preparation pipeline.

Using the Built-in Data Preparation Capability
To get started, pip install and import sagemaker_datawrangler along with the pandas Python package. Then, download the dataset you want to analyze to the notebook working directory, and read the dataset with pandas.

import pandas as pd 
import sagemaker_datawrangler 

!aws s3 cp s3://<YOUR_S3_BUCKET>/data.csv . 

df = pd.read_csv("data.csv")

Now, when you display the data frame, it automatically shows key data visualizations at the top of each column, surfaces data insights, detects data quality issues, and suggests solutions to improve data quality. When you select a column as the target column for ML predictions, you get target-specific insights and warnings, such as mixed data types in target (for regression use cases) or too few instances per class (for classification use cases).

In this example, I’m using the Women’s E-Commerce Clothing Reviews dataset that contains customer reviews and ratings for women’s clothing. This dataset was obtained from Kaggle and has been modified by Amazon to add synthetic data quality issues.

Amazon SageMaker Studio notebooks with built-in data preparation

You can review the suggested data transformations to improve the data quality and apply them right in the UI. For a list of all supported data transformations, have a look at the documentation. Once you apply a data transformation, SageMaker Studio notebooks automatically generate the code to reproduce those data preparation steps in another notebook cell.

For my example, I select Rating as my target column. Target column insights tells me in a high-priority warning that this column has too few instances per class and with a medium-priority warning that classes are too imbalanced. Let’s follow the suggestions and drop rare target values and drop missing values. I will also follow the suggestions for some of the feature columns and drop missing values in the Review Text column and drop the Division Name column.

Once I apply the transformations, the notebook generates this code for me:

# Pandas code generated by sagemaker_datawrangler
output_df = df.copy(deep=True)


# Code to Drop rare target values for column: Rating to resolve warning: Too few instances per class 
rare_target_labels_to_drop = ['-100', '100']
output_df = output_df[~output_df['Rating'].isin(rare_target_labels_to_drop)]


# Code to Drop missing for column: Rating to resolve warning: Missing values 
output_df = output_df[output_df['Rating'].notnull()]


# Code to Drop missing for column: Review Text to resolve warning: Missing values 
output_df = output_df[output_df['Review Text'].notnull()]


# Code to Drop column for column: Division Name to resolve warning: Missing values 
output_df=output_df.drop(columns=['Division Name'])

I can now review and modify the code if needed or start integrating the data transformations as part of my ML development workflow.

Introducing Shared Spaces for Team-Based Sharing and Real-Time Collaboration
SageMaker Studio now offers shared spaces that give data science and ML teams a workspace where they can read, edit, and run notebooks together in real time to streamline collaboration and communication during the development process. Shared spaces provide a shared Amazon EFS directory that you can utilize to share files within a shared space. All taggable SageMaker resources that you create in a shared space are automatically tagged to help you organize and have a filtered view of your ML resources, such as training jobs, experiments, and models, that are relevant to the business problem you work on in the space. This also helps you monitor costs and plan budgets using tools such as AWS Budgets and AWS Cost Explorer.

And that’s not all. You can now also create multiple SageMaker domains within the same AWS account to scope access and isolate resources to different teams or business units in your organization. Now, let me show you how to create a shared space for users within a SageMaker domain.

Using Shared Spaces
You can use the SageMaker console or the AWS CLI to create shared spaces for a SageMaker domain. To get started in the SageMaker console, go to Domains, select or create a new domain, and select Space management on the Domain details page. Then, select Create and give the shared space a name.

Amazon SageMaker Spaces - Create Space

Users in this SageMaker domain can now launch and join the shared space through their SageMaker domain user profiles.

Amazon SageMaker Spaces - Launch Spaces

In a shared space, select the new Collaborators icon in the left navigation menu. You can now see who else is currently active in this space. The following screenshot shows user tom on the left, editing a notebook file. On the right, user antje sees the edits in real time, together with an annotation of the user name that currently edits that notebook cell.

Amazon SageMaker Spaces

New Notebook Capability to Automatically Convert Notebook Code to Production-Ready Jobs
You can now select a notebook and automate it as a job that can run in a production environment without the need to manage the underlying infrastructure. When you create a SageMaker Notebook Job, SageMaker Studio takes a snapshot of the entire notebook, packages its dependencies in a container, builds the infrastructure, runs the notebook as an automated job on a schedule you define, and deprovisions the infrastructure upon job completion. This notebook capability is now also available in SageMaker Studio Lab, our free ML development environment that provides the compute, storage, and security to learn and experiment with ML.

Using the Notebook Capability to Automate Notebooks
To get started, open a notebook file in SageMaker Studio. Then, right-click your notebook file and select Create Notebook Job or select the Create Notebook Job icon, as highlighted in the following screenshot.

Amazon SageMaker Studio - Automate your notebooks

Define a name for the Notebook Job, review the input file location, specify the compute type to use, and whether to run the job immediately or on a schedule. Then, select Create.

Amazon SageMaker Studio - Create Notebook Job

The Notebook Job has been created, and you can review all Notebook Job Definitions in the UI.

Amazon SageMaker Studio - Notebook Job Definitions

Now Available
The new Amazon SageMaker Studio notebook capabilities are now available in all AWS Regions where Amazon SageMaker Studio is available except for the AWS China Regions.

At launch, the built-in data preparation capability powered by SageMaker Data Wrangler is supported for SageMaker Studio notebooks and the following notebook kernel images:

  • Python 3 (Data Science) with Python 3.7
  • Python 3 (Data Science 2.0) with Python 3.8
  • Python 3 (Data Science 3.0) with Python 3.10
  • Spark Analytics 1.0 and 2.0

For more information, visit Amazon SageMaker Notebooks.

Start building your ML projects with the next generation of Amazon SageMaker Notebooks today!

— Antje

New – Share ML Models and Notebooks More Easily Within Your Organization with Amazon SageMaker JumpStart

Amazon SageMaker JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. SageMaker JumpStart gives you access to built-in algorithms with pre-trained models from popular model hubs, pre-trained foundation models to help you perform tasks such as article summarization and image generation, and end-to-end solutions to solve common use cases.

Today, I’m happy to announce that you can now share ML artifacts, such as models and notebooks, more easily with other users that share your AWS account using SageMaker JumpStart.

Using SageMaker JumpStart to Share ML Artifacts
Machine learning is a team sport. You might want to share your models and notebooks with other data scientists in your team to collaborate and increase productivity. Or, you might want to share your models with operations teams to put your models into production. Let me show you how to share ML artifacts using SageMaker JumpStart.

In SageMaker Studio, select Models in the left navigation menu. Then, select Shared models and Shared by my organization. You can now discover and search ML artifacts that other users shared within your AWS account. Note that you can add and share ML artifacts developed with SageMaker as well as those developed outside of SageMaker.

To share a model or notebook, select Add. For models, provide basic information, such as title, description, data type, ML task, framework, and any additional metadata. This information helps other users to find the right models for their use cases. You can also enable training and deployment for your model. This allows users to fine-tune your shared model and deploy the model in just a few clicks through SageMaker JumpStart.

Amazon SageMaker Jumpstart - Add model to private ML hub

To enable model training, you can select an existing SageMaker training job that will autopopulate all relevant information. This information includes the container framework, training script location, model artifact location, instance type, default training and validation datasets, and target column. You can also provide custom model training information by selecting a prebuilt SageMaker Deep Learning Container or selecting a custom Docker container in Amazon ECR. You can also specify default hyperparameters and metrics for model training.

To enable model deployment, you also need to define the container image to use, the inference script and model artifact location, and the default instance type. Have a look at the SageMaker Developer Guide to learn more about model training and model deployment options.

Sharing a notebook works similarly. You need to provide basic information about your notebook and the Amazon S3 location of the notebook file.

Amazon SageMaker JumpStart - Add a notebook to private ML hub

Users that share your AWS account can now browse and select shared models to fine-tune, deploy endpoints, or run notebooks directly in SageMaker JumpStart.

In SageMaker Studio, select Quick start solutions in the left navigation menu, then select Solutions, models, example notebooks to access all shared ML artifacts, together with pre-trained models from popular model hubs and end-to-end solutions.

Amazon SageMaker JumpStart

Now Available
The new ML artifact-sharing capability within Amazon SageMaker JumpStart is available today in all AWS Regions where Amazon SageMaker JumpStart is available. To learn more, visit Amazon SageMaker JumpStart and the SageMaker JumpStart documentation.

Start sharing your models and notebooks with Amazon SageMaker JumpStart today!

— Antje

AWS Machine Learning University New Educator Enablement Program to Build Diverse Talent for ML/AI Jobs

AWS Machine Learning University is now providing a free educator enablement program. This program provides faculty at community colleges, minority-serving institutions (MSIs), and historically Black colleges and universities (HBCUs) with the skills and resources to teach data analytics, artificial intelligence (AI), and machine learning (ML) concepts to build a diverse pipeline for in-demand jobs of today and tomorrow.

According to the National Science Foundation, Black and Hispanic or Latino students earn bachelor’s degrees in Computer Science—the dominant pathway to AI/ML—at a much lower rate than their white peers, earning less than 11 percent of computer science degrees awarded. However, research shows that having diverse perspectives among skilled practitioners and across the AI/ML lifecycle contributes to the development of AI/ML systems that are safe, trustworthy, and have less bias. 

In 2018, we announced the Machine Learning University (MLU) to share with all developers the same courses that we used to train engineers at Amazon and AWS. This platform offers self-service, self-paced, AI/ML digital courses.

Machine Learning University home page

And today, we add this new program to our AI/ML training offering. Although anyone could access the MLU self-paced learning, it places the burden on the learner to source prerequisite work and solutions. This educator enablement program takes the concepts and lessons developed by MLU and makes them more accessible to educators. It offers a year-round educator enablement program with lesson planning, course playbooks, and access to free compute resources.

Program Details
Educators are onboarded in small-group cohorts into bootcamps where they will learn the material and deep dive into how to teach it via instructor-led lectures and hands-on projects. Educators who complete the bootcamp can take part in different year-round development opportunities, such as a dedicated Slack channel to share teaching best practices, education topic series and virtual study sessions moderated by MLU instructors, and regional events for continued professional development. Also, they will receive continuing education credits and AWS-provided stipends.

Faculty and students get access to instructional material through Amazon SageMaker Studio Lab. SageMaker Studio Lab was announced last year and is AWS’s free (no credit card required) ML development environment. It provides computing and storage for anybody that wants to learn and experiment with ML. Institutions can unlock additional resources to support their ML programs by registering for AWS Academy. AWS Academy unlocks all the AWS services for a complete AI/ML program.

Community colleges and universities can integrate this educator enablement program into their computer science, information technology, and business curricula to create an AI/ML course, certificate, or degree. We have worked with educators and education boards such as Houston Community College to create content that is vetted for credit-worthy and degree-earning curricula.

In August 2022, we launched our first educator bootcamp in partnership with The Coding School. The bootcamp was delivered over two weeks, offering lectures, case studies, and hands-on projects. 25 educators completed the Educator Machine Learning Bootcamp, representing 22 US community colleges and universities.

Learn More and Join The Program
During 2023, AWS Machine Learning University will run six educator-enablement cohorts starting in January. The program will give priority consideration to educators at community colleges, MSIs, and HBCUs, in alignment with this program mission to increase access to AI/ML technology to historically underserved and underrepresented students.

If you are a computer science educator or part of a board of educators interested in fostering more depth in your computer science coursework, you should sign up for the educator enablement program.

Marcia

New for Amazon Redshift – Simplify Data Ingestion and Make Your Data Warehouse More Secure and Reliable

When we talk with customers, we hear that they want to be able to harness insights from data in order to make timely, impactful, and actionable business decisions. A common pattern with data-driven organizations is that they have many different data sources they need to ingest into their analytics systems. This requires them to build manual data pipelines spanning across their operational databases, data lakes, streaming data, and data within their warehouse. As a consequence of this complex setup, it can take data engineers weeks or even months to build data ingestion pipelines. These data pipelines are costly, and the delays can lead to missed business opportunities. Additionally, data warehouses are increasingly becoming mission critical systems that require high availability, reliability, and security.

Amazon Redshift is a fully managed petabyte-scale data warehouse used by tens of thousands of customers to easily, quickly, securely, and cost-effectively analyze all their data at any scale. This year at re:Invent, Amazon Redshift has announced a number of features to help you simplify data ingestion and get to insights easily and quickly, within a secure, reliable environment.

In this blog, I introduce some of these new features that fit into two main categories:

  • Simplify data ingestion
    • Amazon Redshift now supports auto-copy from Amazon S3 (available in preview). With this new capability, Amazon Redshift automatically loads the files that arrive in an Amazon Simple Storage Service (Amazon S3) location that you specify into your data warehouse. The files can use any of the formats supported by the Amazon Redshift copy command, such as CSV, JSON, Parquet, and Avro. In this way, you don’t need to manually or repeatedly run copy procedures. Amazon Redshift automates file ingestion and takes care of data-loading steps under the hood.
    • With Amazon Aurora zero-ETL integration with Amazon Redshift, you can use Amazon Redshift for near real-time analytics and machine learning on petabytes of transactional data stored on Amazon Aurora MySQL databases (available in limited preview). With this capability, you can choose the Amazon Aurora databases containing the data you want to analyze with Amazon Redshift. Data is then replicated into your data warehouse within seconds after transactional data is written into Amazon Aurora, eliminating the need to build and maintain complex data pipelines. You can replicate data from multiple Amazon Aurora databases into the same Amazon Redshift instance to run analytics across multiple applications. With near real-time access to transactional data, you can leverage Amazon Redshift’s analytics and capabilities, such as built-in machine learning (ML), materialized views, data sharing, and federated access to multiple data stores and data lakes, to derive insights from transactional and other data.
    • With the general availability of Amazon Redshift Streaming Ingestion, you can now natively ingest hundreds of megabytes of data per second from Amazon Kinesis Data Streams and Amazon MSK into an Amazon Redshift materialized view and query it in seconds. Learn more in this post.
  • Make your data warehouse more secure and reliable
    • You can now improve the availability of your data warehouse by choosing multiple Availability Zone (AZ) deployments. Multi-AZ deployments for your Amazon Redshift clusters are available in preview and reduce recovery times to seconds through automatic recovery. In this way, you can build solutions that are more compliant with the recommendations of the Reliability Pillar of the AWS Well-Architected Framework.
    • With dynamic data masking (available in preview), you can protect sensitive information stored in your data warehouse and ensure that only the relevant data is accessible by users based on their roles. You can limit how much identifiable data is visible to users using multiple levels of policies so different users and groups can have different levels of data access without having to create multiple copies of data. Dynamic data masking complements other granular access control capabilities in Amazon Redshift including row-level and column-level security and role-based access controls. In this way, Dynamic Data Masking helps you meet requirements for GDPR, CCPA, and other privacy regulations.
    • Amazon Redshift now supports central access controls for data sharing with AWS Lake Formation (available in public preview). You can now use Lake Formation to simplify governance of data shared from Amazon Redshift and centrally manage granular access across all data-sharing consumers.

There have been other interesting news for Amazon Redshift at re:Invent you might have already heard about:

  • The general availability of Amazon Redshift integration for Apache Spark makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, opening up the data warehouse for a broader set of AWS analytics and machine learning solutions.
  • AWS Backup now supports Amazon Redshift. AWS Backup allows you to define a central backup policy to manage data protection of your applications and can also protect your Amazon Redshift clusters. In this way, you have a consistent experience when managing data protection across all supported services.

Availability and Pricing
Multi-AZ deployments, central access control for data sharing with AWS Lake Formation, auto-copy from Amazon S3, and dynamic data masking are available in preview in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm).

There is no additional cost for using auto-copy from Amazon S3 and near real-time analytics on transactional data. There is no extra charge for dynamic data masking and central access control for data sharing. For more information, see Amazon Redshift pricing.

These new capabilities take you one step further in analyzing all your data across data sources with simple data ingestion capabilities, while improving the security and reliability of your data warehouse.

Danilo

New — Introducing Support for Real-Time and Batch Inference in Amazon SageMaker Data Wrangler

To build machine learning models, machine learning engineers need to develop a data transformation pipeline to prepare the data. The process of designing this pipeline is time-consuming and requires a cross-team collaboration between machine learning engineers, data engineers, and data scientists to implement the data preparation pipeline into a production environment.

The main objective of Amazon SageMaker Data Wrangler is to make it easy to do data preparation and data processing workloads. With SageMaker Data Wrangler, customers can simplify the process of data preparation and all of the necessary steps of data preparation workflow on a single visual interface. SageMaker Data Wrangler reduces the time to rapidly prototype and deploy data processing workloads to production, so customers can easily integrate with MLOps production environments.

However, the transformations applied to the customer data for model training need to be applied to new data during real-time inference. Without support for SageMaker Data Wrangler in a real-time inference endpoint, customers need to write code to replicate the transformations from their flow in a preprocessing script.

Introducing Support for Real-Time and Batch Inference in Amazon SageMaker Data Wrangler
I’m pleased to share that you can now deploy data preparation flows from SageMaker Data Wrangler for real-time and batch inference. This feature allows you to reuse the data transformation flow which you created in SageMaker Data Wrangler as a step in Amazon SageMaker inference pipelines.

SageMaker Data Wrangler support for real-time and batch inference speeds up your production deployment because there is no need to repeat the implementation of the data transformation flow. You can now integrate SageMaker Data Wrangler with SageMaker inference. The same data transformation flows created with the easy-to-use, point-and-click interface of SageMaker Data Wrangler, containing operations such as Principal Component Analysis and one-hot encoding, will be used to process your data during inference. This means that you don’t have to rebuild the data pipeline for a real-time and batch inference application, and you can get to production faster.

Get Started with Real-Time and Batch Inference
Let’s see how to use the deployment supports of SageMaker Data Wrangler. In this scenario, I have a flow inside SageMaker Data Wrangler. What I need to do is to integrate this flow into real-time and batch inference using the SageMaker inference pipeline.

First, I will apply some transformations to the dataset to prepare it for training.

I add one-hot encoding on the categorical columns to create new features.

Then, I drop any remaining string columns that cannot be used during training.

My resulting flow now has these two transform steps in it.

After I’m satisfied with the steps I have added, I can expand the Export to menu, and I have the option to export to SageMaker Inference Pipeline (via Jupyter Notebook).

I select Export to SageMaker Inference Pipeline, and SageMaker Data Wrangler will prepare a fully customized Jupyter notebook to integrate the SageMaker Data Wrangler flow with inference. This generated Jupyter notebook performs a few important actions. First, define data processing and model training steps in a SageMaker pipeline. The next step is to run the pipeline to process my data with Data Wrangler and use the processed data to train a model that will be used to generate real-time predictions. Then, deploy my Data Wrangler flow and trained model to a real-time endpoint as an inference pipeline. Last, invoke my endpoint to make a prediction.

This feature uses Amazon SageMaker Autopilot, which makes it easy for me to build ML models. I just need to provide the transformed dataset which is the output of the SageMaker Data Wrangler step and select the target column to predict. The rest will be handled by Amazon SageMaker Autopilot to explore various solutions to find the best model.

Using AutoML as a training step from SageMaker Autopilot is enabled by default in the notebook with the use_automl_step variable. When using the AutoML step, I need to define the value of target_attribute_name, which is the column of my data I want to predict during inference. Alternatively, I can set use_automl_step to False if I want to use the XGBoost algorithm to train a model instead.

On the other hand, if I would like to instead use a model I trained outside of this notebook, then I can skip directly to the Create SageMaker Inference Pipeline section of the notebook. Here, I would need to set the value of the byo_model variable to True. I also need to provide the value of algo_model_uri, which is the Amazon Simple Storage Service (Amazon S3) URI where my model is located. When training a model with the notebook, these values will be auto-populated.

In addition, this feature also saves a tarball inside the data_wrangler_inference_flows folder on my SageMaker Studio instance. This file is a modified version of the SageMaker Data Wrangler flow, containing the data transformation steps to be applied at the time of inference. It will be uploaded to S3 from the notebook so that it can be used to create a SageMaker Data Wrangler preprocessing step in the inference pipeline.

The next step is that this notebook will create two SageMaker model objects. The first object model is the SageMaker Data Wrangler model object with the variable data_wrangler_model, and the second is the model object for the algorithm, with the variable algo_model. Object data_wrangler_model will be used to provide input in the form of data that has been processed into algo_model for prediction.

The final step inside this notebook is to create a SageMaker inference pipeline model, and deploy it to an endpoint.

Once the deployment is complete, I will get an inference endpoint that I can use for prediction. With this feature, the inference pipeline uses the SageMaker Data Wrangler flow to transform the data from your inference request into a format that the trained model can use.

In the next section, I can run individual notebook cells in Make a Sample Inference Request. This is helpful if I need to do a quick check to see if the endpoint is working by invoking the endpoint with a single data point from my unprocessed data. Data Wrangler automatically places this data point into the notebook, so I don’t have to provide one manually.

Things to Know
Enhanced Apache Spark configuration — In this release of SageMaker Data Wrangler, you can now easily configure how Apache Spark partitions the output of your SageMaker Data Wrangler jobs when saving data to Amazon S3. When adding a destination node, you can set the number of partitions, corresponding to the number of files that will be written to Amazon S3, and you can specify column names to partition by, to write records with different values of those columns to different subdirectories in Amazon S3. Moreover, you can also define the configuration in the provided notebook.

You can also define memory configurations for SageMaker Data Wrangler processing jobs as part of the Create job workflow. You will find similar configuration as part of your notebook.

Availability — SageMaker Data Wrangler supports for real-time and batch inference as well as enhanced Apache Spark configuration for data processing workloads are generally available in all AWS Regions that Data Wrangler currently supports.

To get started with Amazon SageMaker Data Wrangler supports for real-time and batch inference deployment, visit AWS documentation.

Happy building
— Donnie

New — Amazon SageMaker Data Wrangler Supports SaaS Applications as Data Sources

Data fuels machine learning. In machine learning, data preparation is the process of transforming raw data into a format that is suitable for further processing and analysis. The common process for data preparation starts with collecting data, then cleaning it, labeling it, and finally validating and visualizing it. Getting the data right with high quality can often be a complex and time-consuming process.

This is why customers who build machine learning (ML) workloads on AWS appreciate the ability of Amazon SageMaker Data Wrangler. With SageMaker Data Wrangler, customers can simplify the process of data preparation and complete the required processes of the data preparation workflow on a single visual interface. Amazon SageMaker Data Wrangler helps to reduce the time it takes to aggregate and prepare data for ML.

However, due to the proliferation of data, customers generally have data spread out into multiple systems, including external software-as-a-service (SaaS) applications like SAP OData for manufacturing data, Salesforce for customer pipeline, and Google Analytics for web application data. To solve business problems using ML, customers have to bring all of these data sources together. They currently have to build their own solution or use third-party solutions to ingest data into Amazon S3 or Amazon Redshift. These solutions can be complex to set up and not cost-effective.

Introducing Amazon SageMaker Data Wrangler Supports SaaS Applications as Data Sources
I’m happy to share that starting today, you can aggregate external SaaS application data for ML in Amazon SageMaker Data Wrangler to prepare data for ML. With this feature, you can use more than 40 SaaS applications as data sources via Amazon AppFlow and have these data available on Amazon SageMaker Data Wrangler. Once the data sources are registered in AWS Glue Data Catalog by AppFlow, you can browse tables and schemas from these data sources using Data Wrangler SQL explorer. This feature provides seamless data integration between SaaS applications and SageMaker Data Wrangler using Amazon AppFlow.

Here is a quick preview of this new feature:

This new feature of Amazon SageMaker Data Wrangler works by using integration with Amazon AppFlow, a fully managed integration service that enables you to securely exchange data between SaaS applications and AWS services. With Amazon AppFlow, you can establish bidirectional data integration between SaaS applications, such as Salesforce, SAP, and Amplitude and all supported services, into your Amazon S3 or Amazon Redshift.

Then, with Amazon AppFlow, you can catalog the data in AWS Glue Data Catalog. This is a new feature where with Amazon AppFlow, you can create an integration with AWS Glue Data Catalog for Amazon S3 destination connector. With this new integration, customers can catalog SaaS data applications into AWS Glue Data Catalog with a few clicks, directly from the Amazon AppFlow Flow configuration, without the need to run any crawlers.

Once you’ve established a flow and inserted it into the AWS Glue Data Catalog, you can use this data inside the Amazon SageMaker Data Wrangler. Then, you can do the data preparation as you usually do. You can write Amazon Athena queries to preview data, join data from multiple sources, or import data to prepare for ML model training.

With this feature, you need to do a few simple steps to perform seamless data integration between SaaS applications into Amazon SageMaker Data Wrangler via Amazon AppFlow. This integration supports more than 40 SaaS applications, and for a complete list of supported applications, please check the Supported source and destination applications documentation.

Get Started with Amazon SageMaker Data Wrangler Support for Amazon AppFlow
Let’s see how this feature works in detail. In my scenario, I need to get data from Salesforce, and do the data preparation using Amazon SageMaker Data Wrangler.

To start using this feature, the first thing I need to do is to create a flow in Amazon AppFlow that registers the data source into the AWS Glue Data Catalog. I already have an existing connection with my Salesforce account, and all I need now is to create a flow.

One important thing to note is that to make SaaS application data available in Amazon SageMaker Data Wrangler, I need to create a flow with Amazon S3 as the destination. Then, I need to enable Create a Data Catalog table in the AWS Glue Data Catalog settings. This option will automatically catalog my Salesforce data into AWS Glue Data Catalog.

On this page, I need to select a user role with the required AWS Glue Data Catalog permissions and define the database name and the table name prefix. In addition, in this section, I can define the data format preference, be it in JSON, CSV, or Apache Parquet formats, and filename preference if I want to add a timestamp into the file name section.

To learn more about how to register SaaS data in Amazon AppFlow and AWS Glue Data Catalog, you can read Cataloging the data output from an Amazon AppFlow flow documentation page.

Once I’ve finished registering SaaS data, I need to make sure the IAM role can view the data sources in Data Wrangler from AppFlow. Here is an example of a policy in the IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:SearchTables",
            "Resource": [
                "arn:aws:glue:*:*:table/*/*",
                "arn:aws:glue:*:*:database/*",
                "arn:aws:glue:*:*:catalog"
            ]
        }
    ]
} 

By enabling data cataloging with AWS Glue Data Catalog, from this point on, Amazon SageMaker Data Wrangler will be able to automatically discover this new data source and I can browse tables and schema using the Data Wrangler SQL Explorer.

Now it’s time to switch to the Amazon SageMaker Data Wrangler dashboard then select Connect to data sources.

On the following page, I need to Create connection and select the data source I want to import. In this section, I can see all the available connections for me to use. Here I see the Salesforce connection is already available for me to use.

If I would like to add additional data sources, I can see a list of external SaaS applications that I can integrate into the Set up new data sources section. To learn how to recognize external SaaS applications as data sources, I can learn more with the select How to enable access.

Now I will import datasets and select the Salesforce connection.

On the next page, I can define connection settings and import data from Salesforce. When I’m done with this configuration, I select Connect.

On the following page, I see my Salesforce data that I already configured with Amazon AppFlow and AWS Glue Data Catalog called appflowdatasourcedb. I can also see a table preview and schema for me to review if this is the data I need.

Then, I start building my dataset using this data by performing SQL queries inside the SageMaker Data Wrangler SQL Explorer. Then, I select Import query.

Then, I define a name for my dataset.

At this point, I can start doing the data preparation process. I can navigate to the Analysis tab to run the data insight report. The analysis will provide me with a report on the data quality issues and what transform I need to use next to fix the issues based on the ML problem I want to predict. To learn more about how to use the data analysis feature, see Accelerate data preparation with data quality and insights in the Amazon SageMaker Data Wrangler blog post.

In my case, there are several columns I don’t need, and I need to drop these columns. I select Add step.

One feature I like is that Amazon SageMaker Data Wrangler provides numerous ML data transforms. It helps me to streamline the process of cleaning, transforming and feature engineering my data in one dashboard. For more about what SageMaker Data Wrangler provides for transformation data, please read this Transform Data documentation page.

In this list, I select Manage columns.

Then, in the Transform section, I select the Drop column option. Then, I select a few columns that I don’t need.

Once I’m done, the columns I don’t need are removed and the Drop column data preparation step I just created is listed in the Add step section.

I can also see the visual of my data flow inside the Amazon SageMaker Data Wrangler. In this example, my data flow is quite basic. But when my data preparation process becomes complex, this visual view makes it easy for me to see all the data preparation steps.

From this point on, I can do what I require with my Salesforce data. For example, I can export data directly to Amazon S3 by selecting Export to and choosing Amazon S3 from the Add destination menu. In my case, I specify Data Wrangler to store the data in Amazon S3 after it has processed it by selecting Add destination and then Amazon S3.

Amazon SageMaker Data Wrangler provides me flexibility to automate the same data preparation flow using scheduled jobs. I can also automate feature engineering with SageMaker Pipelines (via Jupyter Notebook) and SageMaker Feature Store (via Jupyter Notebook), and deploy to Inference end point with SageMaker Inference Pipeline (via Jupyter Notebook).

Things to Know
Related news – This feature will make it easy for you to do data aggregation and preparation with Amazon SageMaker Data Wrangler. As this feature is an integration with Amazon AppFlow and also AWS Glue Data Catalog, you might want to learn more on Amazon AppFlow now supports AWS Glue Data Catalog integration and provides enhanced data preparation page.

Availability – Amazon SageMaker Data Wrangler supports SaaS applications as data sources available in all the Regions currently supported by Amazon AppFlow.

Pricing – There is no additional cost to use SaaS applications supports in Amazon SageMaker Data Wrangler, but there is a cost to running Amazon AppFlow to get the data in Amazon SageMaker Data Wrangler.

Visit Import Data From Software as a Service (SaaS) Platforms documentation page to learn more about this feature, and follow the getting started guide to start data aggregating and preparing SaaS applications data with Amazon SageMaker Data Wrangler.

Happy building!
Donnie

Announcing Additional Data Connectors for Amazon AppFlow

Gathering insights from data is a more effective process if that data isn’t fragmented across multiple systems and data stores, whether on premises or in the cloud. Amazon AppFlow provides bidirectional data integration between on-premises systems and applications, SaaS applications, and AWS services. It helps customers break down data silos using a low- or no-code, cost-effective solution that’s easy to reconfigure in minutes as business needs change.

Today, we’re pleased to announce the addition of 22 new data connectors for Amazon AppFlow, including:

  • Marketing connectors (e.g., Facebook Ads, Google Ads, Instagram Ads, LinkedIn Ads).
  • Connectors for customer service and engagement (e.g., MailChimp, Sendgrid, Zendesk Sell or Chat, and more).
  • Business operations (Stripe, QuickBooks Online, and GitHub).

In total, Amazon AppFlow now supports over 50 integrations with various different SaaS applications and AWS services. This growing set of connectors can be combined to enable you to achieve 360 visibility across the data your organization generates. For instance, you could combine CRM (Salesforce), e-commerce (Stripe), and customer service (ServiceNow, Zendesk) data to build integrated analytics and predictive modeling that can guide your next best offer decisions and more. Using web (Google Analytics v4) and social surfaces (Facebook Ads, Instagram Ads) allows you to build comprehensive reporting for your marketing investments, helping you understand how customers are engaging with your brand. Or, sync ERP data (SAP S/4HANA) with custom order management applications running on AWS. For more information on the current range of connectors and integrations, visit the Amazon AppFlow integrations page.

Datasource connectors for Amazon AppFlow

Amazon AppFlow and AWS Glue Data Catalog
Amazon AppFlow has also recently announced integration with the AWS Glue Data Catalog to automate the preparation and registration of your SaaS data into the AWS Glue Data Catalog. Previously, customers using Amazon AppFlow to store data from supported SaaS applications into Amazon Simple Storage Service (Amazon S3) had to manually create and run AWS Glue Crawlers to make their data available to other AWS services such as Amazon Athena, Amazon SageMaker, or Amazon QuickSight. With this new integration, you can populate AWS Glue Data Catalog with a few clicks directly from the Amazon AppFlow configuration without the need to run any crawlers.

To simplify data preparation and improve query performance when using analytics engines such as Amazon Athena, Amazon AppFlow also now enables you to organize your data into partitioned folders in Amazon S3. Amazon AppFlow also automates the aggregation of records into files that are optimized to the size you specify. This increases performance by reducing processing overhead and improving parallelism.

You can find more information on the AWS Glue Data Catalog integration in the recent What’s New post.

Getting Started with Amazon AppFlow
Visit the Amazon AppFlow product page to learn more about the service and view all the available integrations. To help you get started, there’s also a variety of videos and demos available and some sample integrations available on GitHub. And finally, should you need a custom integration, try the Amazon AppFlow Connector SDK, detailed in the Amazon AppFlow documentation. The SDK enables you to build your own connectors to securely transfer data between your custom endpoint, application, or other cloud service to and from Amazon AppFlow‘s library of managed SaaS and AWS connectors.

— Steve

Improved meeting quality when joining on virtual machines

What’s changing

If you use a Virtual Desktop Interface (VDI) such as Citrix or VMWare to join Google Meet calls, you’ll notice an increase in video and audio quality. Meet will now detect whether you’re joining from a VDI and automatically adjust for the best performance. This optimization will also help cut down on the demand put on your VDIs, such as CPU, GPU, and memory usage, helping improve meeting quality and overall performance. 

Getting started 

  • Admins: Ensure Google Meet can detect that it’s running inside a virtual machine (VM) by enabling the Enterprise Hardware Platform API policy in Chrome. Visit the API page and the Help Center to learn more about how to set Chrome policies for users or browsers and use VDIs with Google Meet
  • End users: There is no end user action required — Google Meet will automatically optimize your experience when using a VDI. 

Rollout pace 

Availability 

  • Available to all Google Workspace customers, as well as legacy G Suite Basic and Business customers 

Resources