Some folks often ask me “Is there any appropriate services for ad-hoc and serverless query against unstructured data (flat files) by scaling, such like, Amazon Athena or Google Big Query ?”
Unfortunately, the proven Azure Databricks doesn’t have any corresponding alternative ad-hoc pool (“cluster pool” in Databricks is similar, however computing resources for cluster pool should also be provisioned beforehand), and Azure Data Lake Analytics seems not to be focused in current Azure improvements.
These people will be interested in Serverless SQL pool (formerly, SQL on-demand) in Azure Synapse Analytics.
Using Serverless SQL pool (“Built-in” pool) in Azure Synapse Analytics, you can soon invoke query against CSV, TSV, Parquet, and JSON without the need for preparing and running dedicated computing resources. The system automatically adjusts resources based on your requirements, freeing you up from managing your infrastructure and picking the right size for your solution.
When you run the workload of occasional request’s processing (mostly sitting idle), such as, log analytics or occasional business reports, it will help you save your money.
Getting Started
Before using Azure Synapse Analytics, create an Azure storage account resource and upload your files into blob.
In this tutorial, I used flight-weather parquet dataset in Azure Databricks hands-on.
Now, create Azure Synapse Analytics resource (workspace) in Azure Portal and launch Synapse Studio.
First, click “Develop” menu in left navigation and create a new script file.
As you notice, the default attached computing pool is pre-built pool called “Built-in” (formerly, “SQL on-demand”), because we don’t have any provisioned pools. (See below.) This pool is for Serverless SQL and then you don’t need to change this pool.
In this stage, the attached database might be “master” database and this is also default database in Azure Synapse Analytics workspace.
By default, Serverless SQL pool is trying to access your blob (incl. Data Lake storage) using your Azure Active Directory identity.
However, you should know that you might experience slower performance with Azure AD Pass-through.
In this tutorial, we run severless query using SAS token without AAD pass-through. (Later I’ll show you how to connect remotely with AAD pass-through.)
First, you should generate SAS token in your storage account. (Click “Shared access signature” menu and create a new SAS token in your storage account.)
For security reasons, you cannot create a new database-scoped credential in default master database.
Thus, run the following script to create a new database in your Synapse workspace.
CREATE DATABASE mydb01
In your script editor, change the database setting (see below) into your new database.
Now let’s create a new credential named “sqlondemand
” (in which, SAS token is used as follows) as follows for accessing your blob in database.
-- Set master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys) BEGIN
declare @pasword nvarchar(400) = CAST(newid() as VARCHAR(400));
EXEC('CREATE MASTER KEY ENCRYPTION BY PASSWORD = ''' + @pasword + '''')
END
-- Create credential for blob
IF EXISTS
(SELECT * FROM sys.database_scoped_credentials
WHERE name = 'sqlondemand')
DROP DATABASE SCOPED CREDENTIAL [sqlondemand]
GO
CREATE DATABASE SCOPED CREDENTIAL [sqlondemand]
WITH IDENTITY='SHARED ACCESS SIGNATURE',
SECRET = 'sv=2019-10-10&ss=bfqt&srt...' --fill your storage SAS here
GO
Create data source for the your storage account.
As you see below, here I’m setting “sqlondemand
” as CREDENTIAL
, which is generated in the previous script.
CREATE EXTERNAL DATA SOURCE DemoStorage WITH (
LOCATION = 'https://demostore01.blob.core.windows.net',
CREDENTIAL = sqlondemand
);
GO
Now you can run serverless query as follows. You can run query using T-SQL (not pyspark or Spark SQL) in Serverless SQL pool.
Here we run query for all parquet files on container01/flight_weather_parquet
in the registered data source (DemoStorage
) and simply fetch top 10 results.
When you retrieve all data (large data) in Notebook with Spark pool (or Databricks), it will quickly respond, since the data is loaded sequentially with pagination UI. However, in T-SQL platform, it will take a long time if you have retrieved so large data, since it will load all data at once. (Note that my sample data includes approximately 2,000,000 rows. See the record count by COUNT_BIG(*)
.) Then please filter for the required data as follows. (If the number of columns is also large, please filter for only required columns.)
SELECT TOP 10 *
FROM
OPENROWSET(
BULK 'container01/flight_weather_parquet/*.parquet',
DATA_SOURCE = 'DemoStorage',
FORMAT='PARQUET'
) AS flight_weather
GO
As you know, Synapse Analytics uses a local cache to improve performance and this behavior is the same for Serverless SQL pool. Once cache warming is enabled, the performance will be faster until the cache is invalidated. (Please see the performance by running same query repeatedly.)
In order to reuse the same query, you can also use a view object as follows.
Note that a materialized view is not supported in Serverless SQL pool, because it has no local storage and only metadata objects are stored in Serverless SQL pool. (But you can use CETAS instead. I’ll show you about CETAS later.)
CREATE VIEW FlightBasic
AS SELECT YEAR, MONTH, UNIQUE_CARRIER, ORIGIN, DEST
FROM
OPENROWSET(
BULK 'container01/flight_weather_parquet/*.parquet',
DATA_SOURCE = 'DemoStorage',
FORMAT='PARQUET'
) AS flight_weather
GO
SELECT TOP 10 * FROM FlightBasic
GO
Credentials for Data Source
In above tutorial, we’ve run a query with a SAS token credential.
By default, Serverless SQL pool is trying to access the file using your Azure Active Directory identity. As I mentioned above, Azure AD Pass-through will give you slower performance than SAS token credential. But, sometimes it’s useful, since it’s simple and gives you flexible access control.
Here I show you how to connect with Azure AD pass-through.
In order to connect the storage account with AAD credential, you should assign ‘Storage Blob Data Contributor’ role to yourself in storage account’s resource. (You can add a role assignment by clicking “Access control (IAM)” menu on the blade of your storage account resource. See below.)
Now it’s ready.
You might be logging-in Azure Portal (and Synapse Studio) with your own credential. Thus, there’s no need to create a scoped credential. (Also you don’t need to create a new database.)
You can use the default “master” database and run the following script now.
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'https://demostore01.blob.core.windows.net/container01/flight_weather_parquet/*.parquet',
FORMAT='PARQUET'
) AS flight_weather
As you saw above, this AAD pass-through will suit for your brief picking in UI (Synapse Studio or other management tools), since there’s no need to preparing any database objects.
Connect Programmatically
Serverless SQL pool runs on familiar T-SQL and SQL protocol.
Thus you can run serverless query with the same manners for SQL Server or Azure SQL Database.
For connecting from remote, both SQL authentication and Azure AD authentication are supported in Synapse Analytics. (There are also two administrative accounts, server admin and active directory admin.)
When you use AAD authentication for remote connection, the credential for accessing files will also be consistently passed through.
Let’s see a brief example. In this tutorial, we use SQL authentication.
First, please copy Serverless SQL endpoint (formerly, “SQL on-demand endpoint”) for your Synapse Analytics workspace in the resource blade on Azure Portal.
Next create a login object with username and password for SQL authentication. (Run the following script in master database.)
CREATE LOGIN demouser01 WITH PASSWORD = 'P@ssw0rd0001';
Change database setting to your own database (mydb01
) in script, and run the following script to create a user in your database (mydb01
).
CREATE USER demouser01 FROM LOGIN demouser01;
In order to allow this user to use a previous “sqlondemand
” credential, run the following script and grant permissions.
GRANT CONTROL ON DATABASE SCOPED CREDENTIAL::sqlondemand TO demouser01
Now you can build your programming code to connect and run serverless query in Azure Synapse Analytics !
For instance, the following will invoke the serverless query using JDBC in Scala. (Try the following code in Azure Databricks.)
val jdbcHostname = "myws-ondemand.sql.azuresynapse.net"
val jdbcPort = 1433
val jdbcDatabase = "mydb01"
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
import java.util.Properties
val connectionProperties = new Properties()
val jdbcUsername = "demouser01"
val jdbcPassword = "P@ssw0rd0001"
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
val pushdown_query = "(SELECT TOP 10 * FROM OPENROWSET(BULK 'container01/flight_weather_parquet/*.parquet', DATA_SOURCE = 'DemoStorage', FORMAT='PARQUET') AS flight_weather) top10_flight_list"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
The following will invoke the serverless query using PowerShell. (Sorry, but it uses classical disconnected-styled data access with DataSet
 object in .NET.)
$connStr = "Data Source=myws-ondemand.sql.azuresynapse.net;database=mydb01;User ID=demouser01;Password=P@ssw0rd0001"
$conn = New-Object System.Data.SqlClient.SqlConnection $connStr
$conn.Open()
$cmd = New-Object System.Data.SqlClient.SqlCommand
$cmd.Connection = $conn
$cmd.CommandText = "SELECT TOP 10 * FROM OPENROWSET(BULK 'container01/flight_weather_parquet/*.parquet', DATA_SOURCE = 'DemoStorage', FORMAT='PARQUET') AS flight_weather"
$adp = New-Object System.Data.SqlClient.SqlDataAdapter $cmd
$data = New-Object System.Data.DataSet
$adp.Fill($data)
# show result
$data.Tables
$conn.Close()
You can invoke serverless query from various applications, such as, Excel, Power BI, so on and so forth.
Same like other SQL based database, you can also manage database using Azure Data Studio or SQL Server Management Studio.
Elasticity
Serverless SQL pool has a distributed data processing system and the query for blob is elastically executed in backend Synapse computing resources.
Data (in data lake) is organized into cells.
User query is then divided into query fragments (called query tasks) to hash-distribute for data processing.
Computing nodes are also automatically scaled in the backend.
The distributed query processor (DQP) component in Serverless SQL pool may instruct the need for more compute power to adjust to peaks during the workload. If it’s granted, DQP will then re-distribute tasks leveraging the new compute container. (The in-flight tasks in the previous topology continue running after re-balancing.)
Here I don’t describe details about this mechanism, but see “Democratizing the Data Lake with On-Demand Capabilities in Azure Synapse Analytics” in Microsoft Ignite 2019.
(From : “Democratizing the Data Lake with On-Demand Capabilities in Azure Synapse Analytics”, Microsoft Ignite 2019)
Supported File Formats and Concerns
Currently, CSV (including TSV), Apache Parquet, and JSON (semi-structured) format are supported in Serverless SQL pool.
Serverless SQL pool also allows you to query data in Azure Cosmos DB with Azure Synapse Link. In this post, we only focus on data source for unstructured data (flat files) in blob.
For performance perspective, it will be recommended to use Apache Parquet format.
Parquet is a columnar compression format, and it will then speed up performance for extraction. Unnecessary columns will also be skipped in querying parquet.
Furthermore, the Latin1_General_100_BIN2_UTF8
collation will speed up more, because the query will skip the row groups in parquet based on the predicate in WHERE
clause.
There exist another reason for using Apache Parquet in Serverless SQL pool.
As you saw in above tutorials, the schema structure for underlying files can be auto-detected (inferred) in Serverless SQL pool, such like spark.read()
 in Apache Spark. However, currently, this schema inference works only for parquet format.
For instance, when you use CSV, you should specify all columns in schema description by WITH
clause in OPENROWSET
. (See below.)
In my sample data (see above), there are approximately 60 columns. Imagine that I should explicitly describe each column names and types without schema inference. This will be so cumbersome !
SELECT *
FROM OPENROWSET (
BULK 'https://demostore01.blob.core.windows.net/container01/csv',
FORMAT = 'CSV',
FIELDTERMINATOR =',',
ROWTERMINATOR = '\n'
)
WITH (
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
[year] smallint,
[population] bigint
) AS [r]
WHERE
country_name = 'Luxembourg' AND year = 2017
GO
However, it might be better for performance to specify schema explicitly. (Schema inference should be avoided in production, when using Spark API.)
Now you can query data on delta lake with SQL serverless on Azure Synapse Analytics. (Generally available)
Currently there are also certain limitations for reading delta files in Serverless SQL, and see here for details. (The delta lake support in Serverless SQL is still a work-in-progress.)
Another important note is that delta lake is not currently supported in Serverless SQL pool. (Though delta lake is supported in dedicated SQL pool on Synapse Analytics.)
Assume that your team explore (experiment) data in Azure Databricks and provide presentations using Serverless SQL pool in Azure Synapse Analytics. (Databricks says that over 75% users are now using delta lake in Databricks.) In such a case, you cannot handle directly the delta lake format in Serverless SQL pool. (See a feedback . I hope this will be supported in the future.)
The delta lake use parquet format at the bottom. (The delta lake consists of files with parquet, transaction logs, and indexes/states.)
See “Exercise 9 : Delta Lake” in Azure Databricks tutorial.
Supported T-SQL and Concerns
You can use more advanced query, such as, group-by, order-by, querying nested columns, so on and so forth in Serverless SQL pool. You can also export query results to Azure Storage Blob or Azure Data Lake Storage Gen2 using CETAS (CREATE EXTERNAL TABLE AS SELECT
) statements, and a part of DDL statements is also supported in Serverless SQL pool.
However, you should remember that not all T-SQL operations are supported in Serverless SQL pool, due to architectural reasons.
For instance, T-SQL in Synapse Analytics dedicated SQL pool supports PREDICT()
 function (see here), with which you can infer values with a trained ONNX model by azure machine learning. But, this might not be used in Serverless SQL pool, because Serverless SQL pool doesn’t have local storage and model binary (which must be stored in a table) cannot be reachable.
Currently all DML is not supported in Serverless SQL pool, too.
For details about supported T-SQL, see official document “Serverless SQL pool in Azure Synapse Analytics“.
Private Endpoint for Serverless SQL pool (Networking)
Sometimes you might need to connect to Serverless SQL endpoint using a private endpoint for security reasons.
You can also configure a private endpoint for Azure Synapse Serverless SQL pool using “Private endpoint connections” menu in workspace blade on Azure Portal. (See below.)
Once you have configured a private endpoint for your own virtual network (VNet), you can connect this endpoint from connected networks, including on-premise, through the gateway in VNet.
When you create Synapse workspace with managed virtual network enabled, a private endpoint for Serverless SQL pool is automatically generated in this managed network. This endpoint is used only inside this managed network in Synapse workspace. (See “Manage” menu in Synapse Studio as follows.)
Performance Optimization
Unfortunately, Serverless SQL doesn’t support sys.dm_pdw_request_steps
(which can be used in Synapse dedicated SQL) for performance optimization.
See here for performance optimization in Synapse dedicated SQL. Serverless SQL supports dynamic management views (DMV), such as, sys.dm_exec_connections
, sys.dm_exec_sessions
, or sys.dm_exec_requests
.
However, there exist several tips and best practices for performance optimization in Synapse Serverless SQL, such as, file size, partitioning, data types, etc.
Here I’ll show you several things to check as follows.
Do not return large data
First, you shouldn’t return a large number of results in Serverless SQL query.
When you want to provide a large number of data for users, apply pagination using OFFSET
and FETCH
in T-SQL functions. Also, consider to use small size of data types, if possible – such as, varchar
rather than nvarchar
, smallint
rather than int
.
Use parquet (or delta lake) as possible
As I have mentioned above, it’s recommended to use parquet format (or delta lake) for the performant query, because of native optimization for such as, columnar compression, column skips, etc.
Parquet is also compatible with partitions, and then works better with filepath()
 and filename()
functions in Synapse Serverless SQL as follows. (These functions are also used in other formats.)
SELECT
tpepPickupDateTime,
passengerCount
FROM
OPENROWSET(
BULK 'puYear=*/puMonth=*/*.snappy.parquet',
DATA_SOURCE = 'YellowTaxi',
FORMAT='PARQUET'
) nyc
WHERE
nyc.filepath(1) = 2017
AND nyc.filepath(2) IN (1, 2, 3)
AND tpepPickupDateTime BETWEEN CAST('1/1/2017' AS datetime) AND CAST('3/31/2017' AS datetime)
You can also explicitly filter the partitions as follows in Synapse Serverless SQL.
SELECT
payment_type,
fare
FROM OPENROWSET(
BULK (
'csv/taxi/yellow_tripdata_2017-01.csv',
'csv/taxi/yellow_tripdata_2017-1*.csv'
),
...
);
Serverless SQL has additional performance optimization for querying parquet created in Synapse Analytics Spark pool, because Serverless SQL automatically synchronizes metadata from Synapse Analytics Spark pool. When a table is partitioned, it then targets only the necessary files in a WHERE
clause of query without explicit specifying the partitions.
With delta lake format, reading partitions directly (manually) is not necessary any time and you can use a WHERE
clause of query for data skipping. (The partition elimination will be done automatically.)
The delta lake might be the best choice for performant query in the future, but currently there are also certain limitations, as I have mentioned above.
Statistics
Like Synapse dedicated SQL pool (see here), statistics are automatically created, when the first query targets the table. The distributed query processor (DQP) generates the appropriate query plans based on cost.
For instance, when you filter data with both column A and column B, DQP determines which column should be used to filter at first, based on the distribution of column data.
Therefore you can optimize performance by manually creating (or updating) statistics, such as, in case when you want to warm up for the first query, or in case when the data is largely updated.
Materialize with CETAS
When you want to materialize the frequently used part of query (such as, including JOIN
clause), you can also use CETAS (Create External Table ... As Select ...
) statement as follows. By using CETAS, it will export query results to a parquet file in a data lake, and you can speed up in the next query.
CREATE EXTERNAL TABLE FactSale_CETAS
WITH (
LOCATION = 'FactSale_CETAS/',
DATA_SOURCE = Storage,
FILE_FORMAT = Parquet_file
)
AS
ããSELECT
Dsr.SalesReasonName
, COUNT_BIG(distinct Fis.SalesOrderNumber) SalesOrderNumber_COUNT
, AVG(CAST(SalesAmount AS DECIMAL(38,4))) SalesAmount_AVG
, AVG(CAST(OrderQuantity AS DECIMAL(38,4))) OrderQuantity_AVG
FROM ViewFactSale AS FIS
INNER JOIN ViewFactSaleReason AS Fisr
ON Fisr.SalesOrderNumber = Fis.SalesOrderNumber
AND Fisr.SalesOrderLineNumber = Fis.SalesOrderLineNumber
INNER JOIN ViewDimSales AS Dsr
ON Fisr.SalesReasonKey = Dsr.SalesReasonKey
GROUP BY Fis.SalesTerritoryKey, Fis.OrderDateKey, Dsr.SalesReasonName
When you want to update (refresh) results in CETAS table, you should recreate CETAS table, such as, in Azure Synapse Pipeline.
Please see the best practice guide to get the best performance in Serverless SQL.