Things to Know About Serverless SQL Pool in Azure Synapse Analytics

Some folks often ask me “Is there any appropriate services for ad-hoc and serverless query against unstructured data (flat files) by scaling, such like, Amazon Athena or Google Big Query ?”
Unfortunately, the proven Azure Databricks doesn’t have any corresponding alternative ad-hoc pool (“cluster pool” in Databricks is similar, however computing resources for cluster pool should also be provisioned beforehand), and Azure Data Lake Analytics seems not to be focused in current Azure improvements.
These people will be interested in serverless SQL pool (formerly, SQL on-demand) in Azure Synapse Analytics.

Using serverless SQL pool (“Built-in” pool) in Azure Synapse Analytics, you can soon invoke query against CSV, TSV, Parquet, and JSON without the need for preparing and running dedicated computing resources. The system automatically adjusts resources based on your requirements, freeing you up from managing your infrastructure and picking the right size for your solution.
When you run the workload of occasional request’s processing (mostly sitting idle), such as, log analytics or occasional business reports, it will help you save your money.

Getting Started

Before using Azure Synapse Analytics, create an Azure storage account resource and upload your files into blob.
In this tutorial, I used flight-weather parquet dataset in Azure Databricks hands-on.

Now, create Azure Synapse Analytics resource (workspace) in Azure Portal and launch Synapse Studio.

First, click “Develop” menu in left navigation and create a new script file.

As you notice, the default attached computing pool is pre-built pool called “Built-in” (formerly, “SQL on-demand”), because we don’t have any provisioned pools. (See below.) This pool is for serverless SQL pool and then you don’t need to change this pool.
In this stage, the attached database might be “master” database and this is also default database in Azure Synapse Analytics workspace.

By default, serverless SQL pool is trying to access your blob (incl. Data Lake storage) using your Azure Active Directory identity.
However, you should know that you might experience slower performance with Azure AD Pass-through.

In this tutorial, we run severless query using SAS token without AAD pass-through. (Later I’ll show you how to connect remotely with AAD pass-through.)

First, you should generate SAS token in your storage account. (Click “Shared access signature” menu and create a new SAS token in your storage account.)

For security reasons, you cannot create a new database-scoped credential in default master database.
Thus, run the following script to create a new database in your Synapse workspace.

CREATE DATABASE mydb01

In your script editor, change the database setting (see below) into your new database.

Now let’s create a new credential named “sqlondemand”  (in which, SAS token is used as follows) as follows for accessing your blob in database.

-- Set master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys) BEGIN
  declare @pasword nvarchar(400) = CAST(newid() as VARCHAR(400));
  EXEC('CREATE MASTER KEY ENCRYPTION BY PASSWORD = ''' + @pasword + '''')
END

-- Create credential for blob
IF EXISTS
   (SELECT * FROM sys.database_scoped_credentials
   WHERE name = 'sqlondemand')
   DROP DATABASE SCOPED CREDENTIAL [sqlondemand]
GO
CREATE DATABASE SCOPED CREDENTIAL [sqlondemand]
WITH IDENTITY='SHARED ACCESS SIGNATURE',  
SECRET = 'sv=2019-10-10&ss=bfqt&srt...'  --fill your storage SAS here
GO

Create data source for the your storage account.
As you see below, here I’m setting “sqlondemand” as CREDENTIAL, which is generated in the previous script.

CREATE EXTERNAL DATA SOURCE DemoStorage WITH (
  LOCATION = 'https://demostore01.blob.core.windows.net',
  CREDENTIAL = sqlondemand
);
GO

Now you can run serverless query as follows ! You can run query using T-SQL (not pyspark or Spark SQL) in serverless SQL pool.

Here we run query for all parquet files on container01/flight_weather_parquet in the registered data source (DemoStorage) and simply fetch top 10 results.
When you retrieve all data (large data) in Notebook with Spark pool (or Databricks), it will quickly respond, since the data is loaded sequentially with pagination UI. However, in T-SQL platform, it will take a long time if you have retrieved so large data, since it will load all data at once. (Note that my sample data includes approximately 2,000,000 rows. See the record count by COUNT_BIG(*).) Then please filter for the required data as follows. (If the number of columns is also large, please filter for only required columns.)

SELECT TOP 10 *
FROM
  OPENROWSET(
    BULK 'container01/flight_weather_parquet/*.parquet',
    DATA_SOURCE = 'DemoStorage',
    FORMAT='PARQUET'
  ) AS flight_weather
GO

As you know, Synapse Analytics uses a local cache to improve performance and this behavior is the same for serverless SQL pool. Once cache warming is enabled, the performance will be faster until the cache is invalidated. (Please see the performance by running same query repeatedly.)

In order to reuse the same query, you can also use a view object as follows.
Note that a materialized view is not supported in serverless SQL pool, because it has no local storage and only metadata objects are stored in serverless SQL pool.

CREATE VIEW FlightBasic
AS SELECT YEAR, MONTH, UNIQUE_CARRIER, ORIGIN, DEST
FROM
  OPENROWSET(
    BULK 'container01/flight_weather_parquet/*.parquet',
    DATA_SOURCE = 'DemoStorage',
    FORMAT='PARQUET'
  ) AS flight_weather
GO
SELECT TOP 10 * FROM FlightBasic
GO

Credentials for Data Source

In above tutorial, we’ve run a query with a SAS token credential.
By default, Serverless SQL pool is trying to access the file using your Azure Active Directory identity. As I mentioned above, Azure AD Pass-through will give you slower performance than SAS token credential. But, sometimes it’s useful, since it’s simple and gives you flexible access control.
Here I show you how to connect with Azure AD pass-through.

In order to connect the storage account with AAD credential, you should assign ‘Storage Blob Data Contributor’ role to yourself in storage account’s resource. (You can add a role assignment by clicking “Access control (IAM)” menu on the blade of your storage account resource. See below.)

Now it’s ready.
You might be logging-in Azure Portal (and Synapse Studio) with your own credential. Thus, there’s no need to create a scoped credential. (Also you don’t need to create a new database.)
You can use the default “master” database and run the following script now.

SELECT TOP 10 *
FROM OPENROWSET(
  BULK 'https://demostore01.blob.core.windows.net/container01/flight_weather_parquet/*.parquet',
  FORMAT='PARQUET'
) AS flight_weather

As you saw above, this AAD pass-through will suit for your brief picking in UI (Synapse Studio or other management tools), since there’s no need to preparing any database objects.

Connect Programmatically

Serverless SQL pool runs on familiar T-SQL and SQL protocol.
Thus you can run serverless query with the same manners for SQL Server or Azure SQL Database.

For connecting from remote, both SQL authentication and Azure AD authentication are supported in Synapse Analytics. (There are also two administrative accounts, server admin and active directory admin.)
When you use AAD authentication for remote connection, the credential for accessing files will also be consistently passed through.

Let’s see a brief example. In this tutorial, we use SQL authentication.

First, please copy serverless SQL endpoint (formerly, “SQL on-demand endpoint”) for your Synapse Analytics workspace in the resource blade on Azure Portal.

Next create a login object with username and password for SQL authentication. (Run the following script in master database.)

CREATE LOGIN demouser01 WITH PASSWORD = 'P@ssw0rd0001';

Change database setting to your own database (mydb01) in script, and run the following script to create a user in your database (mydb01).

CREATE USER demouser01 FROM LOGIN demouser01;

In order to allow this user to use a previous “sqlondemand” credential, run the following script and grant permissions.

GRANT CONTROL ON DATABASE SCOPED CREDENTIAL::sqlondemand TO demouser01

Now you can build your programming code to connect and run serverless query in Azure Synapse Analytics !

For instance, the following will invoke the serverless query using JDBC in Scala. (Try the following code in Azure Databricks.)

val jdbcHostname = "myws-ondemand.sql.azuresynapse.net"
val jdbcPort = 1433
val jdbcDatabase = "mydb01"
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

import java.util.Properties
val connectionProperties = new Properties()

val jdbcUsername = "demouser01"
val jdbcPassword = "P@ssw0rd0001"
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")

val pushdown_query = "(SELECT TOP 10 * FROM OPENROWSET(BULK 'container01/flight_weather_parquet/*.parquet', DATA_SOURCE = 'DemoStorage', FORMAT='PARQUET') AS flight_weather) top10_flight_list"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)

The following will invoke the serverless query using PowerShell. (Sorry, but it uses classical disconnected-styled data access with DataSet object in .NET.)

$connStr = "Data Source=myws-ondemand.sql.azuresynapse.net;database=mydb01;User ID=demouser01;Password=P@ssw0rd0001"
$conn = New-Object System.Data.SqlClient.SqlConnection $connStr
$conn.Open()

$cmd = New-Object System.Data.SqlClient.SqlCommand
$cmd.Connection = $conn
$cmd.CommandText = "SELECT TOP 10 * FROM OPENROWSET(BULK 'container01/flight_weather_parquet/*.parquet', DATA_SOURCE = 'DemoStorage', FORMAT='PARQUET') AS flight_weather"

$adp = New-Object System.Data.SqlClient.SqlDataAdapter $cmd
$data = New-Object System.Data.DataSet
$adp.Fill($data)

# show result
$data.Tables

$conn.Close()

You can invoke serverless query from various applications, such as, Excel, Power BI, so on and so forth.
Same like other SQL based database, you can also manage database using Azure Data Studio or SQL Server Management Studio.

Elasticity

Serverless SQL pool is a distributed data processing system and the query for blob is elastically executed in backend Synapse computing resources.

Here I don’t describe details about this mechanism, but see “Democratizing the Data Lake with On-Demand Capabilities in Azure Synapse Analytics” in Microsoft Ignite 2019.

(From : “Democratizing the Data Lake with On-Demand Capabilities in Azure Synapse Analytics”, Microsoft Ignite 2019)

Supported File Formats and Concerns

Currently, CSV (including TSV), Apache Parquet, and JSON (semi-structured) format are supported in Serverless SQL pool.

Serverless SQL pool also allows you to query data in Azure Cosmos DB with Azure Synapse Link. In this post, we only focus on data source for unstructured data (flat files) in blob.

For performance perspective, it will be recommended to use Apache Parquet (columnar-base format), but there exist another reason for using Apache Parquet in Serverless SQL pool.

As you saw in above tutorials, the schema structure for underlying files can be auto-detected (inferred) in serverless SQL pool, such like spark.read() in Apache Spark. However, currently, this schema inference works only for parquet format.
For instance, when you use CSV, you should specify all columns in schema description by WITH clause in OPENROWSET. (See below.)
In my sample data (see above), there are approximately 60 columns. Imagine that I should explicitly describe each column names and types without schema inference. This will be so cumbersome !

SELECT *
FROM OPENROWSET (
  BULK 'https://demostore01.blob.core.windows.net/container01/csv',
  FORMAT = 'CSV',
  FIELDTERMINATOR =',',
  ROWTERMINATOR = '\n'
)
WITH (
  [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
  [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
  [year] smallint,
  [population] bigint
) AS [r]
WHERE
  country_name = 'Luxembourg' AND year = 2017
GO

However, it might be better for performance to specify schema explicitly. (Schema inference should be avoided in production, when using Spark API.)

Now you can query data on delta lake with SQL serverless on Azure Synapse Analytics. (Generally available)
Another important note is that delta lake is not currently supported in serverless SQL pool. (Though delta lake is supported in dedicated SQL pool on Synapse Analytics.)
Assume that your team explore (experiment) data in Azure Databricks and provide presentations using Serverless SQL pool in Azure Synapse Analytics. (Databricks says that over 75% users are now using delta lake in Databricks.) In such a case, you cannot handle directly the delta lake format in serverless SQL pool. (See a feedback . I hope this will be supported in the future.)

As you know, delta lake is built on parquet with journal logs and indexes/states. (See “Exercise 9 : Delta Lake” in Azure Databricks tutorial.)

Supported T-SQL and Concerns

You can use more advanced query, such as, group-by, order-by, querying nested columns, so on and so forth in serverless SQL pool. You can also export query results to Azure Storage Blob or Azure Data Lake Storage Gen2 using CETAS (CREATE EXTERNAL TABLE AS SELECT) statements, and a part of DDL statements is also supported in serverless SQL pool.

However, you should remember that not all T-SQL operations are supported in serverless SQL pool, due to architectural reasons.
For instance, T-SQL in Synapse Analytics dedicated SQL pool supports PREDICT() function (see here), with which you can infer values with a trained ONNX model by azure machine learning. But, this might not be used in serverless SQL pool, because serverless SQL pool doesn’t have local storage and model binary (which must be stored in a table) cannot be reachable.
Currently all DML is not supported in serverless SQL pool, too.

For details about supported T-SQL, see official document “Serverless SQL pool in Azure Synapse Analytics“.

Private Endpoint for Serverless SQL pool (Networking)

Sometimes you might need to connect to serverless SQL endpoint using a private endpoint for security reasons.
You can also configure a private endpoint for Azure Synapse Serverless SQL pool using “Private endpoint connections” menu in workspace blade on Azure Portal. (See below.)

Once you have configured a private endpoint for your own virtual network (VNet), you can connect this endpoint from connected networks, including on-premise, through the gateway in VNet.

When you create Synapse workspace with managed virtual network enabled, a private endpoint for Serverless SQL pool is automatically generated in this managed network. This endpoint is used only inside this managed network in Synapse workspace. (See “Manage” menu in Synapse Studio as follows.)

Performance Optimization

Unfortunately, serverless SQL doesn’t have rich DMVs like dedicated SQL in Synapse does. (See here for performance optimization in Synapse dedicated SQL.)
However, there exist several tips and best practices for performance optimization in Synapse serverless SQL, such as, file size, partitioning, data type, etc.

Please see here (best practice guide) to get the best performance in serverless SQL.

One thought on “Things to Know About Serverless SQL Pool in Azure Synapse Analytics

  1. Pingback: SQL on-demand in Azure Synapse Analytics | James Serra's Blog

Comments are closed.