Posted by

Posted on

December 5, 2021

Posted under

Use secured storage in public AML workspace

Data is often critical in enterprise machine learning.

To meet your needs, Azure Machine Learning (AML) provides the way to set-up an AML workspace behind VNet.
However, it will need a lot of time-consuming tasks for you and your team to configure all related infrastructure and components.

In this post, I’ll show you how to configure secured storage in public AML workspace, without exposing any ports or endpoints for the protected data.

Provision secured storage in AML

First of all, let’s create a virtual network (VNet) to protect Azure storage account. (Later I’ll configure storage account behind this VNet.)

Next, create a public AML workspace as usual.
Make sure to create a workspace in the same region with previous VNet.

When you have created an AML workspace, you will see the default storage account in this resource group. (See below.)
Go to this storage account and click “Networking” for configuring private endpoints and virtual networks.

In firewalls and virtual networks settings (in storage account), click “Selected networks” to configure the protected storage.
This will disable public access for this storage account.

In this settings, add the previous virtual network (VNet) and subnet to enable access from this VNet.

To enable your client to access this storage in client (such as, Azure Portal, Azure Storage Explorer, etc), enable “Add your client IP address”. (See below.)
This setting is optional.

In “Exceptions” settings, make sure to enable “Allow Azure services on the trusted services list to access this storage account”. (See below.)

Make sure to click “Save” button to confirm these all settings.

[Optional] Create a Private Endpoint

In order to enable a private endpoint connection for this storage, configure the additional setting as follows.

To create a private endpoint in this virtual network (VNet), click “private endpoint connections” tab in “Networking” setting in storage account.
In this pane, click the following button and proceed the wizards.

In this private endpoint’s setting (wizards), set the following properties.

Region : the same region with previous VNet
Target sub resource : blob
Virtual network / Subnet : the previous VNet and subnet
Integrate with private DNS zone : Yes

Secure training with protected data

To run Python script and train without exposing ports, such as, ssh and RDP, you should create a virtual machine (VM) connected to this VNet, and make sure to remove (disassociate) VM public IP address on this VM.
(See here for configuring your own VM for AML.)

There are several options to configure network for connecting this VM from your local client.
Here I’ll show you several deployment architectures as follows.

Note : When you have configured virtual network in AML’s default storage, you can create a compute instance with virtual network enabled in AML studio UI. (See below.)
However, Jupyter notebook functionalities in AML compute instance can only be accessed with *.instances.azureml.ms address.

When you use a compute cluster in your script, you should also configure VNet with provisioning_configuration() as follows.

from azureml.core.compute import AmlCompute, ComputeTarget

compute_config = AmlCompute.provisioning_configuration(
  vm_size='Standard_NC6',
  min_nodes=0,
  max_nodes=1,
  vnet_resourcegroup_name="AML-Test-rg",
  vnet_name="vnet01",
  subnet_name="default")
compute_target = ComputeTarget.create(ws, "cluster01", compute_config)

Now you can run Python script with AML Python SDK in this secured virtual machine (secured VM) as usual, working with your protected data.

Secured container registry (ACR)

You can also use secured container registry (ACR) together with public AML workspace.
Both storage and container registry should be in the same virtual network and subnet.

When you run image building with this protected container registry (with a private endpoint), make sure to run the following command.
(In Azure Machine Learning, it cannot use a private endpoint to directly build Docker images, and the compute cluster should be used to build the images instead.)

ws.update(image_build_compute = 'cluster01')

Posted by

Subaru Kokubun

Posted on

May 23, 2020

Posted under

Cloud, Data

Comments

1 Comment

Things to Know About Serverless SQL Pool in Azure Synapse Analytics

Some folks often ask me “Is there any appropriate services for ad-hoc and serverless query against unstructured data (flat files) by scaling, such like, Amazon Athena or Google Big Query ?”
Unfortunately, the proven Azure Databricks doesn’t have any corresponding alternative ad-hoc pool (“cluster pool” in Databricks is similar, however computing resources for cluster pool should also be provisioned beforehand), and Azure Data Lake Analytics seems not to be focused in current Azure improvements.
These people will be interested in Serverless SQL pool (formerly, SQL on-demand) in Azure Synapse Analytics.

Using Serverless SQL pool (“Built-in” pool) in Azure Synapse Analytics, you can soon invoke query against CSV, TSV, Parquet, and JSON without the need for preparing and running dedicated computing resources. The system automatically adjusts resources based on your requirements, freeing you up from managing your infrastructure and picking the right size for your solution.
When you run the workload of occasional request’s processing (mostly sitting idle), such as, log analytics or occasional business reports, it will help you save your money.

Getting Started

Before using Azure Synapse Analytics, create an Azure storage account resource and upload your files into blob.
In this tutorial, I used flight-weather parquet dataset in Azure Databricks hands-on.

Now, create Azure Synapse Analytics resource (workspace) in Azure Portal and launch Synapse Studio.

First, click “Develop” menu in left navigation and create a new script file.

As you notice, the default attached computing pool is pre-built pool called “Built-in” (formerly, “SQL on-demand”), because we don’t have any provisioned pools. (See below.) This pool is for Serverless SQL and then you don’t need to change this pool.
In this stage, the attached database might be “master” database and this is also default database in Azure Synapse Analytics workspace.

By default, Serverless SQL pool is trying to access your blob (incl. Data Lake storage) using your Azure Active Directory identity.
However, you should know that you might experience slower performance with Azure AD Pass-through.

In this tutorial, we run severless query using SAS token without AAD pass-through. (Later I’ll show you how to connect remotely with AAD pass-through.)

First, you should generate SAS token in your storage account. (Click “Shared access signature” menu and create a new SAS token in your storage account.)

For security reasons, you cannot create a new database-scoped credential in default master database.
Thus, run the following script to create a new database in your Synapse workspace.

CREATE DATABASE mydb01

In your script editor, change the database setting (see below) into your new database.

Now let’s create a new credential named “sqlondemand” (in which, SAS token is used as follows) as follows for accessing your blob in database.

-- Set master key
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys) BEGIN
  declare @pasword nvarchar(400) = CAST(newid() as VARCHAR(400));
  EXEC('CREATE MASTER KEY ENCRYPTION BY PASSWORD = ''' + @pasword + '''')
END

-- Create credential for blob
IF EXISTS
   (SELECT * FROM sys.database_scoped_credentials
   WHERE name = 'sqlondemand')
   DROP DATABASE SCOPED CREDENTIAL [sqlondemand]
GO
CREATE DATABASE SCOPED CREDENTIAL [sqlondemand]
WITH IDENTITY='SHARED ACCESS SIGNATURE',  
SECRET = 'sv=2019-10-10&ss=bfqt&srt...'  --fill your storage SAS here
GO

Create data source for the your storage account.
As you see below, here I’m setting “sqlondemand” as CREDENTIAL, which is generated in the previous script.

CREATE EXTERNAL DATA SOURCE DemoStorage WITH (
  LOCATION = 'https://demostore01.blob.core.windows.net',
  CREDENTIAL = sqlondemand
);
GO

Now you can run serverless query as follows. You can run query using T-SQL (not pyspark or Spark SQL) in Serverless SQL pool.

Here we run query for all parquet files on container01/flight_weather_parquet in the registered data source (DemoStorage) and simply fetch top 10 results.
When you retrieve all data (large data) in Notebook with Spark pool (or Databricks), it will quickly respond, since the data is loaded sequentially with pagination UI. However, in T-SQL platform, it will take a long time if you have retrieved so large data, since it will load all data at once. (Note that my sample data includes approximately 2,000,000 rows. See the record count by COUNT_BIG(*).) Then please filter for the required data as follows. (If the number of columns is also large, please filter for only required columns.)

SELECT TOP 10 *
FROM
  OPENROWSET(
    BULK 'container01/flight_weather_parquet/*.parquet',
    DATA_SOURCE = 'DemoStorage',
    FORMAT='PARQUET'
  ) AS flight_weather
GO

As you know, Synapse Analytics uses a local cache to improve performance and this behavior is the same for Serverless SQL pool. Once cache warming is enabled, the performance will be faster until the cache is invalidated. (Please see the performance by running same query repeatedly.)

In order to reuse the same query, you can also use a view object as follows.
Note that a materialized view is not supported in Serverless SQL pool, because it has no local storage and only metadata objects are stored in Serverless SQL pool. (But you can use CETAS instead. I’ll show you about CETAS later.)

CREATE VIEW FlightBasic
AS SELECT YEAR, MONTH, UNIQUE_CARRIER, ORIGIN, DEST
FROM
  OPENROWSET(
    BULK 'container01/flight_weather_parquet/*.parquet',
    DATA_SOURCE = 'DemoStorage',
    FORMAT='PARQUET'
  ) AS flight_weather
GO

SELECT TOP 10 * FROM FlightBasic
GO

Credentials for Data Source

In above tutorial, we’ve run a query with a SAS token credential.
By default, Serverless SQL pool is trying to access the file using your Azure Active Directory identity. As I mentioned above, Azure AD Pass-through will give you slower performance than SAS token credential. But, sometimes it’s useful, since it’s simple and gives you flexible access control.
Here I show you how to connect with Azure AD pass-through.

In order to connect the storage account with AAD credential, you should assign ‘Storage Blob Data Contributor’ role to yourself in storage account’s resource. (You can add a role assignment by clicking “Access control (IAM)” menu on the blade of your storage account resource. See below.)

Now it’s ready.
You might be logging-in Azure Portal (and Synapse Studio) with your own credential. Thus, there’s no need to create a scoped credential. (Also you don’t need to create a new database.)
You can use the default “master” database and run the following script now.

SELECT TOP 10 *
FROM OPENROWSET(
  BULK 'https://demostore01.blob.core.windows.net/container01/flight_weather_parquet/*.parquet',
  FORMAT='PARQUET'
) AS flight_weather

As you saw above, this AAD pass-through will suit for your brief picking in UI (Synapse Studio or other management tools), since there’s no need to preparing any database objects.

Connect Programmatically

Serverless SQL pool runs on familiar T-SQL and SQL protocol.
Thus you can run serverless query with the same manners for SQL Server or Azure SQL Database.

For connecting from remote, both SQL authentication and Azure AD authentication are supported in Synapse Analytics. (There are also two administrative accounts, server admin and active directory admin.)
When you use AAD authentication for remote connection, the credential for accessing files will also be consistently passed through.

Let’s see a brief example. In this tutorial, we use SQL authentication.

First, please copy Serverless SQL endpoint (formerly, “SQL on-demand endpoint”) for your Synapse Analytics workspace in the resource blade on Azure Portal.

Next create a login object with username and password for SQL authentication. (Run the following script in master database.)

CREATE LOGIN demouser01 WITH PASSWORD = 'P@ssw0rd0001';

Change database setting to your own database (mydb01) in script, and run the following script to create a user in your database (mydb01).

CREATE USER demouser01 FROM LOGIN demouser01;

In order to allow this user to use a previous “sqlondemand” credential, run the following script and grant permissions.

GRANT CONTROL ON DATABASE SCOPED CREDENTIAL::sqlondemand TO demouser01

Now you can build your programming code to connect and run serverless query in Azure Synapse Analytics !

For instance, the following will invoke the serverless query using JDBC in Scala. (Try the following code in Azure Databricks.)

val jdbcHostname = "myws-ondemand.sql.azuresynapse.net"
val jdbcPort = 1433
val jdbcDatabase = "mydb01"
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"

import java.util.Properties
val connectionProperties = new Properties()

val jdbcUsername = "demouser01"
val jdbcPassword = "P@ssw0rd0001"
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")

val pushdown_query = "(SELECT TOP 10 * FROM OPENROWSET(BULK 'container01/flight_weather_parquet/*.parquet', DATA_SOURCE = 'DemoStorage', FORMAT='PARQUET') AS flight_weather) top10_flight_list"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)

The following will invoke the serverless query using PowerShell. (Sorry, but it uses classical disconnected-styled data access with DataSet object in .NET.)

$connStr = "Data Source=myws-ondemand.sql.azuresynapse.net;database=mydb01;User ID=demouser01;Password=P@ssw0rd0001"
$conn = New-Object System.Data.SqlClient.SqlConnection $connStr
$conn.Open()

$cmd = New-Object System.Data.SqlClient.SqlCommand
$cmd.Connection = $conn
$cmd.CommandText = "SELECT TOP 10 * FROM OPENROWSET(BULK 'container01/flight_weather_parquet/*.parquet', DATA_SOURCE = 'DemoStorage', FORMAT='PARQUET') AS flight_weather"

$adp = New-Object System.Data.SqlClient.SqlDataAdapter $cmd
$data = New-Object System.Data.DataSet
$adp.Fill($data)

# show result
$data.Tables

$conn.Close()

You can invoke serverless query from various applications, such as, Excel, Power BI, so on and so forth.
Same like other SQL based database, you can also manage database using Azure Data Studio or SQL Server Management Studio.

Elasticity

Serverless SQL pool has a distributed data processing system and the query for blob is elastically executed in backend Synapse computing resources.

Data (in data lake) is organized into cells.
User query is then divided into query fragments (called query tasks) to hash-distribute for data processing.

Computing nodes are also automatically scaled in the backend.
The distributed query processor (DQP) component in Serverless SQL pool may instruct the need for more compute power to adjust to peaks during the workload. If it’s granted, DQP will then re-distribute tasks leveraging the new compute container. (The in-flight tasks in the previous topology continue running after re-balancing.)

Here I don’t describe details about this mechanism, but see “Democratizing the Data Lake with On-Demand Capabilities in Azure Synapse Analytics” in Microsoft Ignite 2019.

(From : “Democratizing the Data Lake with On-Demand Capabilities in Azure Synapse Analytics”, Microsoft Ignite 2019)

Supported File Formats and Concerns

Currently, CSV (including TSV), Apache Parquet, and JSON (semi-structured) format are supported in Serverless SQL pool.

Serverless SQL pool also allows you to query data in Azure Cosmos DB with Azure Synapse Link. In this post, we only focus on data source for unstructured data (flat files) in blob.

For performance perspective, it will be recommended to use Apache Parquet format.
Parquet is a columnar compression format, and it will then speed up performance for extraction. Unnecessary columns will also be skipped in querying parquet.
Furthermore, the Latin1_General_100_BIN2_UTF8 collation will speed up more, because the query will skip the row groups in parquet based on the predicate in WHERE clause.

There exist another reason for using Apache Parquet in Serverless SQL pool.
As you saw in above tutorials, the schema structure for underlying files can be auto-detected (inferred) in Serverless SQL pool, such like spark.read() in Apache Spark. However, currently, this schema inference works only for parquet format.
For instance, when you use CSV, you should specify all columns in schema description by WITH clause in OPENROWSET. (See below.)
In my sample data (see above), there are approximately 60 columns. Imagine that I should explicitly describe each column names and types without schema inference. This will be so cumbersome !

SELECT *
FROM OPENROWSET (
  BULK 'https://demostore01.blob.core.windows.net/container01/csv',
  FORMAT = 'CSV',
  FIELDTERMINATOR =',',
  ROWTERMINATOR = '\n'
)
WITH (
  [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
  [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
  [year] smallint,
  [population] bigint
) AS [r]
WHERE
  country_name = 'Luxembourg' AND year = 2017
GO

However, it might be better for performance to specify schema explicitly. (Schema inference should be avoided in production, when using Spark API.)

Now you can query data on delta lake with SQL serverless on Azure Synapse Analytics. (Generally available)
Currently there are also certain limitations for reading delta files in Serverless SQL, and see here for details. (The delta lake support in Serverless SQL is still a work-in-progress.)
~~Another important note is that delta lake is not currently supported in Serverless SQL pool. (Though delta lake is supported in dedicated SQL pool on Synapse Analytics.)~~
Assume that your team explore (experiment) data in Azure Databricks and provide presentations using Serverless SQL pool in Azure Synapse Analytics. (Databricks says that over 75% users are now using delta lake in Databricks.) In such a case, you cannot handle directly the delta lake format in Serverless SQL pool. (See a feedback . I hope this will be supported in the future.)

The delta lake use parquet format at the bottom. (The delta lake consists of files with parquet, transaction logs, and indexes/states.)
See “Exercise 9 : Delta Lake” in Azure Databricks tutorial.

Supported T-SQL and Concerns

You can use more advanced query, such as, group-by, order-by, querying nested columns, so on and so forth in Serverless SQL pool. You can also export query results to Azure Storage Blob or Azure Data Lake Storage Gen2 using CETAS (CREATE EXTERNAL TABLE AS SELECT) statements, and a part of DDL statements is also supported in Serverless SQL pool.

However, you should remember that not all T-SQL operations are supported in Serverless SQL pool, due to architectural reasons.
For instance, T-SQL in Synapse Analytics dedicated SQL pool supports PREDICT() function (see here), with which you can infer values with a trained ONNX model by azure machine learning. But, this might not be used in Serverless SQL pool, because Serverless SQL pool doesn’t have local storage and model binary (which must be stored in a table) cannot be reachable.
Currently all DML is not supported in Serverless SQL pool, too.

For details about supported T-SQL, see official document “Serverless SQL pool in Azure Synapse Analytics“.

Private Endpoint for Serverless SQL pool (Networking)

Sometimes you might need to connect to Serverless SQL endpoint using a private endpoint for security reasons.
You can also configure a private endpoint for Azure Synapse Serverless SQL pool using “Private endpoint connections” menu in workspace blade on Azure Portal. (See below.)

Once you have configured a private endpoint for your own virtual network (VNet), you can connect this endpoint from connected networks, including on-premise, through the gateway in VNet.

When you create Synapse workspace with managed virtual network enabled, a private endpoint for Serverless SQL pool is automatically generated in this managed network. This endpoint is used only inside this managed network in Synapse workspace. (See “Manage” menu in Synapse Studio as follows.)

Performance Optimization

Unfortunately, Serverless SQL doesn’t support sys.dm_pdw_request_steps (which can be used in Synapse dedicated SQL) for performance optimization.

See here for performance optimization in Synapse dedicated SQL. Serverless SQL supports dynamic management views (DMV), such as, sys.dm_exec_connections, sys.dm_exec_sessions, or sys.dm_exec_requests.

However, there exist several tips and best practices for performance optimization in Synapse Serverless SQL, such as, file size, partitioning, data types, etc.
Here I’ll show you several things to check as follows.

Do not return large data

First, you shouldn’t return a large number of results in Serverless SQL query.
When you want to provide a large number of data for users, apply pagination using OFFSET and FETCH in T-SQL functions. Also, consider to use small size of data types, if possible – such as, varchar rather than nvarchar, smallint rather than int.

Use parquet (or delta lake) as possible

As I have mentioned above, it’s recommended to use parquet format (or delta lake) for the performant query, because of native optimization for such as, columnar compression, column skips, etc.
Parquet is also compatible with partitions, and then works better with filepath() and filename() functions in Synapse Serverless SQL as follows. (These functions are also used in other formats.)

SELECT
  tpepPickupDateTime,
  passengerCount
FROM  
  OPENROWSET(
    BULK 'puYear=*/puMonth=*/*.snappy.parquet',
    DATA_SOURCE = 'YellowTaxi',
    FORMAT='PARQUET'
  ) nyc
WHERE
  nyc.filepath(1) = 2017
  AND nyc.filepath(2) IN (1, 2, 3)
  AND tpepPickupDateTime BETWEEN CAST('1/1/2017' AS datetime) AND CAST('3/31/2017' AS datetime)

You can also explicitly filter the partitions as follows in Synapse Serverless SQL.
SELECT 
  payment_type,  
  fare
FROM OPENROWSET(
  BULK (
    'csv/taxi/yellow_tripdata_2017-01.csv',
    'csv/taxi/yellow_tripdata_2017-1*.csv'
  ),
  ...

);

Serverless SQL has additional performance optimization for querying parquet created in Synapse Analytics Spark pool, because Serverless SQL automatically synchronizes metadata from Synapse Analytics Spark pool. When a table is partitioned, it then targets only the necessary files in a WHERE clause of query without explicit specifying the partitions.

With delta lake format, reading partitions directly (manually) is not necessary any time and you can use a WHERE clause of query for data skipping. (The partition elimination will be done automatically.)
The delta lake might be the best choice for performant query in the future, but currently there are also certain limitations, as I have mentioned above.

Statistics

Like Synapse dedicated SQL pool (see here), statistics are automatically created, when the first query targets the table. The distributed query processor (DQP) generates the appropriate query plans based on cost.
For instance, when you filter data with both column A and column B, DQP determines which column should be used to filter at first, based on the distribution of column data.
Therefore you can optimize performance by manually creating (or updating) statistics, such as, in case when you want to warm up for the first query, or in case when the data is largely updated.

Materialize with CETAS

When you want to materialize the frequently used part of query (such as, including JOIN clause), you can also use CETAS (Create External Table ... As Select ...) statement as follows. By using CETAS, it will export query results to a parquet file in a data lake, and you can speed up in the next query.

CREATE EXTERNAL TABLE FactSale_CETAS
WITH (
  LOCATION = 'FactSale_CETAS/',
  DATA_SOURCE = Storage,
  FILE_FORMAT  = Parquet_file
)  
AS
  　　SELECT
    Dsr.SalesReasonName
    , COUNT_BIG(distinct Fis.SalesOrderNumber) SalesOrderNumber_COUNT
    , AVG(CAST(SalesAmount AS DECIMAL(38,4))) SalesAmount_AVG
    , AVG(CAST(OrderQuantity AS DECIMAL(38,4))) OrderQuantity_AVG
  FROM ViewFactSale AS FIS 
  INNER JOIN  ViewFactSaleReason AS Fisr 
  ON Fisr.SalesOrderNumber = Fis.SalesOrderNumber
  AND Fisr.SalesOrderLineNumber = Fis.SalesOrderLineNumber
  INNER JOIN ViewDimSales AS Dsr 
  ON Fisr.SalesReasonKey = Dsr.SalesReasonKey
  GROUP BY Fis.SalesTerritoryKey, Fis.OrderDateKey, Dsr.SalesReasonName

When you want to update (refresh) results in CETAS table, you should recreate CETAS table, such as, in Azure Synapse Pipeline.

Please see the best practice guide to get the best performance in Serverless SQL.

Posted by

Subaru Kokubun

Posted on

April 5, 2019

Posted under

Cloud, Data

Comments

1 Comment

Programming for Apache Kafka (Quickstart using Cloud Managed Service)

Here I show you step-by-step tutorials for Apache Kafka with Azure HDInsight.
Azure HDInsight is based on famous Hortonworks (see here) and the 1st party managed Hadoop offering in Azure. (i.e, You can take Azure support service for asking about HDInsight service.)

Apache Kafka is open-source and you can take a benefit for a large number of ecosystems (tools, libraries, etc) like a variety of Kafka connectors. Apache Kafka can also be installed on-premise or on cloud-hosted virtual machines, then you cannot be locked into a specific platform. (If you need, you can run on anywhere like AWS, on-premise, etc.)

In this post, I show you trivial examples for popular components, such as producer, consumer, and streams for your beginning. But once the stream is ingested into Apache Kafka, the real-time streaming data can also be analyzed by Databricks and the aggregated results can be connected into massive transaction database (Cosmos DB if you’re using Azure) or analysis database (SQL DW in Azure). (See here for how to collaborate Spark structured streaming with Kafka.)
You can take a lot of advanced approaches for speed-layer processing using Apache Kafka. (This post only shows the very beginning tutorials for programming.)

Ordinarily you could configure VNet and run applications (such as producers, consumers, or streams) in your secure network (see below. please refer here for VNet settings), but in this post we run all applications in primary headnode (built-in ssh server) for simplifying our example. (As you can see here, Kafka broker’s port is non-public.)

Create Managed Kafka Cluster (HDInsight)

First you must create Azure storage account (blob storage) in Azure Portal. All Kafka meta files such as logs, histories, and settings are stored in this blob storage.

Next you create HDInsight resource in Azure Portal.

In the creation wizard, please set the following properties.

Select “Kafka” for Hadoop’s “Cluster Type”. (In this post, we use Kafka version 1.1.0 with HDInsight version 3.6.)
Select previous storage (Azure storage account) for metadata storage.
Set appropriate node number and machine size in your cluster (Trade-off between capacity and pricing).
Note that VM pricing on HDInsight is reasonable and almost equivalent with the regular VM pricing (just a little bit higher, approximately x1.1 – x1.2).

Create Topic

Before running applications, you must create Kafka topic which is used for storing messages.

First you must copy Zookeeper server hosts (FQDN) of your Hadoop cluster.
Lanuch Ambari UI (https://{your cluster name}.azurehdinsight.net/) with your web browser and click “Zookeeper” and “Config” tab. You can view and copy Zookeeper’s hosts and port as the following screenshot.

Next login to head node ({your cluster name}-ssh.azurehdinsight.net) using SSH. You can also get the host’s address with “SSH + Cluster login” menu on HDInsight resource blade in Azure Portal. (See below.)

Set your zookeeper hosts into $KAFKAZKHOSTS environment variable by running the following command. (You must change the following zookeeper’s host list into your previously copied host names. Here we assume we have 3 hosts for zookeeper in the following sample.)

export KAFKAZKHOSTS=zk1-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:2181,zk2-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net,zk3-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:2181

Now you create topic named “test” with the following command.

/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create \
  --replication-factor 3 \
  --partitions 8 \
  --topic test \
  --zookeeper $KAFKAZKHOSTS \
  --config retention.ms=10000

Here we create with 8 partitions and 3 replication for topic “test”. (See below.) And message’s retention is 10 seconds (10,000 ms).

Run Producer Application

Now we create and run a Kafka producer, which sends messages into your generated topic.

First you must copy the broker hosts (FQDN) of Kafka cluster.
Click “Kafka” in Ambari UI and view broker hosts and port in “Kafka Broker” section.

Now here we create a producer with Python !
Install kafka-python and jupyter with the following command on the head node. (As I described earlier, here we run our producer on head node for only test purpose.)

sudo apt install python3-pip
pip3 install kafka-python
pip3 install jupyter

After installation is completed, run jupyter notebook with the following command.

jupyter notebook

Your notebook might be running on localhost:8888.
Then you should connect this port using SSH tunnel (port forwarding) with your terminal client. The steps how to set SSH tunnel depends on your using terminal client’s software and see the document in your terminal client. For instance, if you’re using PuTTY client in Windows, you can set SSH tunnels by the following screenshot. (If you are in Mac OS, you can use built-in open SSH client.)

After you’ve connected with SSH tunnel (port forwarding) settings, please show notebook (go to http://localhost:8888/?token=...) with your web browser and run the following script.
Note that you should change host list for bootstrap_servers to your previously copied broker servers.

from kafka import KafkaProducer
import time;
import json

id = 0
for _ in range(9):
  message = "{etype} {etime}".format(
    etype='Close' if id%3==2 else 'Open',
    etime=time.time() * 1000)
  id = id + 1

  producer = KafkaProducer(
    bootstrap_servers=['wn0-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092','wn1-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092','wn2-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092'])
  producer.send('test', message.encode('ascii'))
  time.sleep(0.2)

This script sends 9 messages with event’s type and timestamp (such as “Open 1554443425483.7773“) into your topic. (Each message will be marked as “deleted” after 10 seconds by the previous “retention.ms=10000” configuration in your topic.)
Now let’s consume these messages in the next section !

Run Consumer Application

In order to consume messages in topic, run the following script in your notebook.
This application permanently consumes the arriving messages by the following “for” loop.

from kafka import KafkaConsumer
consumer = KafkaConsumer(
  bootstrap_servers=['wn0-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092','wn1-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092','wn2-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092'],
  auto_offset_reset='earliest')
consumer.subscribe(['test'])
for msg in consumer:
  print(msg)

Now let’s see the output results when you send messages with the previous producer !
You will find each 9 messages are arriving in your consumer by sending messages with your producer.

Run Kafka Stream (Time Window Sample)

Finally we window input messages with Kafka stream.
Unfortunately we can’t implement Kafka stream with Python, then here we implement with java. (Here we also run our application on head node for test purpose.)

Before starting, install maven with the following command.

sudo apt install maven

Create maven project with the following command and move to project’s directory.

# create project
mvn archetype:generate \
  -DgroupId=com.example \
  -DartifactId=Streaming-Test \
  -DarchetypeArtifactId=maven-archetype-quickstart \
  -DinteractiveMode=false
# move to project directory
cd Streaming-Test

Then you can find the file “pom.xml” (project configuration) and directory structures (which includes template source code) in your current directory.

First you must open pom.xml and write as follows.

Here we’re setting dependencies for Kafka stream’s libraries and using maven-shade-plugin in order to contain all dependencies in generated jar file. (Or you must copy any non-standard jars in your classpath. Otherwise you will get java.lang.ClassNotFoundException.)

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>Streaming-Test</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>Streaming-Test</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-streams</artifactId>
      <version>1.1.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-clients</artifactId>
      <version>1.1.0</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.3</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.3</version>
        <configuration>
          <transformers>
            <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"/>
            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
              <mainClass>com.example.App</mainClass>
            </transformer>
          </transformers>
        </configuration>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>  
</project>

You can find the template source code in ./src/main/java/com/example/App.java.
Open this source file (App.java) and write the following code. (Note that the host list of BOOTSTRAP_SERVERS_CONFIG must be changed for your broker’s hosts.)

In this code, we’re doing :

Reading string inputs from “test” topic
Grouping by event’s type (“Open” or “Close”) and event’s time in the message body for each 1 second (1000 ms) time window (Setting record’s count in each window)
Writing results as serialized JSON object into “test-count” topic

The message conversion is written by Kafka Streams Domain Specific Language (DSL) (such as map(), groupByKey(), windowedBy(), etc). See the official document “Apache Kafka – Streams DSL” for details.

package com.example;

import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;

import org.apache.kafka.common.utils.Bytes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;

import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.kstream.KeyValueMapper;
import org.apache.kafka.streams.Consumed;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.kstream.Serialized;
import org.apache.kafka.streams.kstream.TimeWindows;
import org.apache.kafka.streams.kstream.Windowed;

// For Custom TimestampExtractor
import org.apache.kafka.streams.processor.TimestampExtractor;
import com.fasterxml.jackson.databind.JsonNode;
import org.apache.kafka.clients.consumer.ConsumerRecord;

// For Custom Serializer / Deserializer
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.kafka.common.serialization.Serializer;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.errors.SerializationException;

import java.util.Properties;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;

public class App {

  /**********
    Helpers
  **********/

  static public class WindowedEventData {
    public long windowStart;
    public String eventType;
  }

  static public class EventCountData {
    public String eventType;
    public long eventCount;
  }
  
  // Custom Timestamp Extractor for Windowing
  static public class EventTimestampExtractor implements TimestampExtractor
  {
    @Override
    public long extract(final ConsumerRecord<Object, Object> record, final long previousTimestamp) {
      String[] eventArr = ((String) record.value()).toLowerCase().split("\\W+");
      return Long.parseLong(eventArr[1]);
    }
  }

  // Serializer (POJO -> byte[])
  static public class JsonPOJOSerializer<T> implements Serializer<T> {
    private final ObjectMapper objectMapper = new ObjectMapper();

    public JsonPOJOSerializer() {
    }
    
    @Override
    public void configure(Map<String, ?> props, boolean isKey) {
    }

    @Override
    public byte[] serialize(String topic, T data) {
      try {
        return objectMapper.writeValueAsBytes(data);
      } catch (Exception e) {
        throw new SerializationException("Error JSON serialize", e);
      }
    }

    @Override
    public void close() {
    }
  }

  // Deserializer (byte[] -> POJO)
  static public class JsonPOJODeserializer<T> implements Deserializer<T> {
    private ObjectMapper objectMapper = new ObjectMapper();

    private Class<T> tClass;

    public JsonPOJODeserializer() {
    }

    @Override
    public void configure(Map<String, ?> props, boolean isKey) {
      tClass = (Class<T>) props.get("JsonPOJOClass");
    }

    @Override
    public T deserialize(String topic, byte[] bytes) {
      try {
        return objectMapper.readValue(bytes, tClass);
      } catch (Exception e) {
        throw new SerializationException("Error JSON deserialize", e);
      }
    }

    @Override
    public void close() {
    }
  }

  /**********
    Main
  **********/

  public static void main(final String[] args) throws Exception {
    // property settings for kafka stream
    Properties props = new Properties();
    props.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application");
    props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "wn0-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092,wn1-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092,wn2-kafkat.3fddup3yjcxeljyp35tee4ajwf.bx.internal.cloudapp.net:9092");
    props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, EventTimestampExtractor.class);

    // generate custom serde for serializing POJO to the stream
    Map<String, Object> serdeProps = new HashMap<>();
    
    final Serializer<WindowedEventData> windowedEventSerializer = new JsonPOJOSerializer<>();
    serdeProps.put("JsonPOJOClass", WindowedEventData.class);
    windowedEventSerializer.configure(serdeProps, false);
    final Deserializer<WindowedEventData> windowedEventDeserializer = new JsonPOJODeserializer<>();
    serdeProps.put("JsonPOJOClass", WindowedEventData.class);
    windowedEventDeserializer.configure(serdeProps, false);
    final Serde<WindowedEventData> windowedEventSerde = Serdes.serdeFrom(windowedEventSerializer, windowedEventDeserializer);

    final Serializer<EventCountData> eventCountSerializer = new JsonPOJOSerializer<>();
    serdeProps.put("JsonPOJOClass", EventCountData.class);
    eventCountSerializer.configure(serdeProps, false);
    final Deserializer<EventCountData> eventCountDeserializer = new JsonPOJODeserializer<>();
    serdeProps.put("JsonPOJOClass", EventCountData.class);
    eventCountDeserializer.configure(serdeProps, false);
    final Serde<EventCountData> eventCountSerde = Serdes.serdeFrom(eventCountSerializer, eventCountDeserializer);

    // input value is string such as "{event_type} {event_time}"
    StreamsBuilder builder = new StreamsBuilder();
    KStream<String, String> eventObjs = builder.stream("test", Consumed.with(Serdes.String(), Serdes.String()));
    KStream<WindowedEventData, EventCountData> windowedObjs = eventObjs
      // change key to event_type
      .map(new KeyValueMapper<String, String, KeyValue<String, String>>() {
        @Override
        public KeyValue<String, String> apply(String key, String eventStr) {
          String[] eventArr = eventStr.toLowerCase().split("\\W+");
          return new KeyValue<>(eventArr[0], eventStr);
        }
      })
      // group by event_type
      .groupByKey(Serialized.with(Serdes.String(), Serdes.String()))
      // windowing (separated by event_type)
      .windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(1)))
      // count elements on each window
      .count()
      // change KTable to KStream
      .toStream()
      // change key and value to WindowedEventData and EventCountData respectively
      .map(new KeyValueMapper<Windowed<String>, Long, KeyValue<WindowedEventData, EventCountData>>() {
        @Override
        public KeyValue<WindowedEventData, EventCountData> apply(Windowed<String> key, Long value) {
          WindowedEventData wEventData = new WindowedEventData();
          wEventData.windowStart = key.window().start();
          wEventData.eventType = key.key();
          EventCountData eCountData = new EventCountData();
          eCountData.eventType = key.key();
          eCountData.eventCount = value;          
          return new KeyValue<>(wEventData, eCountData);
        }
      });
    // stream output to "test-count" topic
    windowedObjs.to("test-count", Produced.with(windowedEventSerde, eventCountSerde));

    // stream processing start !
    KafkaStreams streams = new KafkaStreams(builder.build(), props);
    streams.start();
  }

}

Run the following command to build your application.
If succeeded, you could find ./target/Streaming-Test-1.0-SNAPSHOT.jar.

mvn clean package

Before running this application, create new topic “test-count” (in which the windowed messages are stored) as follows. (Here we also set 10 seconds for message retention.)

/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create \
  --replication-factor 3 \
  --partitions 8 \
  --topic test-count \
  --zookeeper $KAFKAZKHOSTS \
  --config retention.ms=10000

Now run your generated Kafka stream application with the following command !
This application waits messages in “test” topic and converts incoming messages. The converted messages are sent into “test-count” topic.

java -jar ./target/Streaming-Test-1.0-SNAPSHOT.jar

Please run your previous producer (send messages) and consume “test-count” topic. (Make sure to change consumer’s source code to subscribe “test-count” topic.)
You can find the following windowed results (grouping results by each 1000 ms) in consumer’s output.

Posted by

Subaru Kokubun

Posted on

November 9, 2012

Posted under

Data

MongoDB でドキュメント DB の魅力を 30 分で学ぶ (記事紹介)

今回、@IT さんに記事を載させてもらったので紹介したい。

@IT 特集 : MongoDBで理解する「ドキュメント・データベース」の世界 (1)
http://www.atmarkit.co.jp/ait/articles/1211/09/news056.html

@IT 特集 : MongoDBで理解する「ドキュメント・データベース」の世界 (2)http://www.atmarkit.co.jp/ait/articles/1211/30/news040.html

プログラマーのために、ちゃんと実際のコードなどを使って本質から理解 (=実感) してもらう主旨。前半は MongoDB を例に書いて、後半では、逆に MongoDB 特有なところがあるので、そこをわかるように書こうと思う。

なかなか文書にして伝えるのってむずかしいよね。著者の本音で書くと、要は CAP 定理みたいな考え方の本質、つまり、「KVS とは、実現したいある究極のゴールのために、今ある便利さの一部を削ってでも実現した実装系」であり、ドキュメントデータベースはそれをより現実的、かつ理想的な形で俯瞰したものという点を伝えたかった。でも、説明や前置きが長かったり、数学じみた話とか難しかったりすると、なかなか一般向けの文書として読んでもらえない。そこで、いろいろ悩んだ末、こんな感じの例示にしたという感じ。

MongoDB を例にドキュメントデータベースを語って終わると、今後は、ドキュメントデータベースは全部そうだって誤解するので、後半は、得意の RavenDB をひっぱり出して (あえて Couch ではなくてゴメン)、「実は、MongoDB は、こんなところが特別だった」という観点で、逆の側面から見ていこうと思う。(両方読めば、RDB な人たちも、「なんか、ドキュメントデータベースが見えてきた !」となるようにしたい。)

個人的には RavenDB のほうが好きなんだけど、やっぱ、世界でもっとも使われている MongoDB のほうが読む気がするよね。(RavenDB、US では頻繁に Workshop や BootCamp なんかもおこなわれてます . . . 人口が多いって、うらやましい)

Posted by

Subaru Kokubun

Posted on

November 2, 2012

Posted under

Cloud, Data

MongoLab で、Cloud な MongoDB 活用

最近アナウンスがあったが (ここに日本語で紹介してくれている)、 Azure store で提供されているサービスが、Azure Portal から使えるようになった。Azure 上で Multi tenant で提供されている著名なサービス (例えば、MySQL のクラウド版の ClearDB や、SMTP メールの SendGrid など) とか、データを提供する各種サービスが Azure ポータルから使えるわけだ。

ここで嬉しいのは、Amazon (AWS) だけでなく、Azure からも MongoLab が扱える点だ。(MongoLab は、MongoDB 版の “Database as a Service” と思ってもらえば良い。データベースサーバーの監視や起動・停止、バックアップなど、SLA を自分で実装するのではなく、契約ベースで利用する。)
Azure Virtual Machine (Azure VM) で、Linux や Windows で MongoDB を立てても良いが、 Azure ポータルから MongoLab のデータベースを立て、「MongoDB のサービス」として Azure 上の C# などからアクセスできるようになる。つまり、すべて PaaS のプラットフォームを使って簡易に利用できる。(この場合、もちろん、使用する MongoDB は Azure 上の Region で hosting される。)

残念なのが、現段階 (2012/11/03) の Azure Preview では、アカウント Profile が United States の場合しかアドオンを利用できないようで、つまり、日本で契約している場合は、まだ上図の画面を拝むことはできない。(この制約はいずれなくなるらしいが、少なくとも今は無理だ。)
2012/12/25 追記 : ようやく、日本語の Azure Preview Portal からも使えるようになった。
そこで、ClearDB なども同じだが、本家の MongoLab のサイト (https://mongolab.com/home) から Azure の MongoLab が使えるので、今回は、その手順で、簡単に使い方を紹介したい。(MongoDB そのものに関する細かな手法については、専用のサイトを参照してほしい。)

Sign-up と Database の作成

まずは、MongoLab のサイト (https://mongolab.com/home) に行ってサインアップをおこなう。ちなみに、0.5 GB までの Shared Plan (他のデータベースと、仮想マシンを共有) なら無償で使える。

そして、データベースを作成するが、ここで、下図の通り、Provider として Azure を選択しよう。

データベース作成の際に、Database User の作成とパスワードを設定するが、このあと説明するように、パスワードは URI としても使用されるので、できるだけ @ (アットマーク) など URI の予約語は使わないほうが良い。

なお、残念ながら、現時点の MongoLab では、Windows Azure 用の Dedicated VM Plan はないらしく、専用の VM で作って VNET (Virtual Network) で構成するなどの使い方は試せない。(まあ、Amazon 版の Dedicated VM Plan も、まだ Beta なんだけどね)

データベースが作成されたら、データベースが使用している xxxxxxxx.mongolab.com のサーバーの名前解決をしてみると良いだろう。ちゃんと、 zzzzzzz.cloudapp.net という Azure のドメインで動いていることがわかる。また、データベースの管理画面 (下図) を見て、使用している MongoDB のバージョンもチェックしておこう。下図の通り、右下に mongod プロセスのバージョンが表示されているが、現在は、バージョン 2.0.7 であることがわかる。(このため、C# から Linq とかも問題なく使える。)

このあと見ていくように、MongoLab は、MongoDB のコマンドラインユーティリティ (mongo) を使って管理できるが、この際、できる限り、同じバージョンの MongoDB を入れておいたほうが良い。そのためにも、使っている mongod のバージョンはちゃんと把握しておこう。

Command Line からの管理

では、実際に、コンソール (mongo) から管理をおこなってみよう。上図の画面の通り、MongoDB の接続先のアドレスが表示されているので、MongoDB をインストールして、コマンドラインユーティリティ (mongo) でこのアドレスに接続する。(下記の dbuser と dbpassword は、データベース作成時に追加したユーザー情報だ。また、server name, database port の部分もテナントによって異なるので注意してほしい。)

mongo <server name>.mongolab.com:<database port>/<database name>
  -u <dbuser> -p <dbpassword>

いつものようにプロンプトが表示されるので、あとは、普通の MongoDB の使い方と同じだ。例えば、下記では、現在使用しているデータベースの名前を取得している。

> db.getName();
testdb

下記では、 Orders コレクションの Name プロパティに Index 作成をおこなっている。

> db.Orders.ensureIndex({Name:1});

コマンドラインを使う場合、1 つ注意点がある。Azure の癖を知っている人には説明の必要はないと思うが、 Azure は、一定時間 Idle 状態の接続は強制切断される。このため、コマンドラインユーティリティも長時間ログインしたままにせず、面倒かもしれないが、仕事が終わったら、まめに exit しよう。(Socket エラーなど、切断されていたら、再度、mongo で入りなおす。)

プログラミング言語 (Driver) からの接続

プログラミング言語からも、いつものように利用できる。(なので、特に説明の必要はないが、念のため書いておこう)
例えば、C# から接続する場合、いつものように、10gen の Official C# driver を NuGet から取得する。

あとはプログラミングをおこなうだけだ。Driver から接続する際の接続文字列も、上記の管理画面に表示されている。以下のフォーマットの接続文字列となる。

mongodb://<dbuser>:<dbpassword>@<server name>.mongolab.com:<server port>/<database name>

以下のコードを記述すると、登録や検索が問題なくできることがわかる。何度も繰り返すが、一般的な MongoDB の使い方と何ら変わらない。(下記は、ASP.NET MVC のサンプルコードだ。)

using MongoDB.Driver;
using MongoDB.Bson;
using MongoDB.Driver.Linq;

public class Order
{
  public ObjectId _id { get; set; }
  public string Name { get; set; }
  public int Price { get; set; }
  public string Category { get; set; }
}

public ActionResult Test1()
{
  MongoServer server = MongoServer.Create(@"mongodb://<dbuser>:
    <dbpassword>@<server name>.mongolab.com:<server port>
    /<database name>");
  MongoDatabase db = server["<database name>"];
  MongoCollection col = db.GetCollection("Orders");

  // save
  Order obj1 = new Order()
  {
    Name = "test1",
    Price = 100,
    Category = "material"
  };
  col.Insert(obj1);
  Order obj2 = new Order()
  {
    Name = "test2",
    Price = 200,
    Category = "material"
  };
  col.Insert(obj2);
  Order obj3 = new Order()
  {
    Name = "test3",
    Price = 150,
    Category = "food"
  };
  col.Insert(obj3);

  // find (Linq)
  Order sel = (from c in col.AsQueryable()
        where c.Name == "test2"
        select c).FirstOrDefault();

  ViewBag.Message = string.Format("_id:{0}, Price:{1}",
    sel._id, sel.Price);

  return View();
}

あとは、完成したアプリケーションを Azure に発行すれば、同じ Region で動作する MongoLab を使ったクラウドアプリケーションの出来上がりだ。(パフォーマンスなどを考慮し、できるだけ、MongoLab のデータベースと同じ Region に発行しておこう。)

Bulk の処理などを実行してもらうとわかるが、まあそれほど違和感ない速度で返ってくる。(ただし、Shared なので、いろいろ状況に応じ変わってくるとは思うが。。。)

(追記) Dedicated 登場

「Mongolab : Announcing New MongoDB Instances on Microsoft Azure」にあるように、MongoLab の Dedicate 版が提供された。

以前はスケーラブルな構成がむずかしかったが、この Dedicated 版により、待望の Replica Set、さらに「Mongolab : Plans & Features」によると Sharding Cluster も利用できるようなので、是非お試しあれ。(まだ試していない)

Posted by

Subaru Kokubun

Posted on

June 27, 2012

Posted under

Data

Comments

1 Comment

RavenDB の Replication, Scale Out (Sharding)

前回紹介した RavenDB について、続きを記載しようと思う。

今回は、RavenDB を使った Failover と Scaling について記載しておく。RavenDB における Scaling では、Scale Up ではなく、後述する Shared data set による Scale Out の手法が採用されている。

まずは、本題に入る前に、いくつか準備をしておこう。

準備 (RavenDB の HTTP ホスト)

今回は、複数の RavenDB サービスを起動して実験するので、RavenDB を HTTP ホストで起動する。(実行ファイルを展開して、起動する。)
まず、RavenDB の Build をダウンロードして、ダウンロードした zip を展開する。つぎに、RavenDB のインストールと起動をおこなう。RavenDB をインストール (実行) するには、インストールフォルダに移動して、以下のコマンドを実行する。(/uninstall で簡単にアンインストールできる。)

.\Server\Raven.Server.exe /install

上記のコマンドを実行すると、下図の通り、Windows のサービスが登録されて起動する。(次回から、OS の起動の際に、自動で起動する。)

なお、今回、データベースに対してどのような要求が渡されたか確認するため、デバッグモードで実行してみる。デバッグモードで実行するには、上記のコマンドではなく、下記のコマンドを実行する。
デバッグモードの場合、上記のような Windows サービスではなく、実行したコンソール上で HTTP のプロセスが実行され、どのような要求を処理したかコンソール上に表示されるようになる。

cd .\Server
Raven.Server.exe –debug

なお、既定では、ポート 8080 で起動する。(起動しているかどうかは、ブラウザーで、http://localhost:8080/raven/studio.html にアクセスしてみると良い。) このポート番号を変更するには、 Server\Raven.Server.exe.config を開いて、下記の通り Raven/Port を変更すれば良い。(複数の RavenDB を同じマシンで起動するには、このポートを変更して起動すれば良い。)

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="Raven/Port" value="8081"/>
    <add key="Raven/DataDir" value="~\Data"/>
    <add key="Raven/AnonymousAccess" value="Get"/>
  </appSettings>
  . . .
</configuration>

RavenDB を管理するには、下記の URL に接続して、RavenDB Management Studio というブラウザーインターフェイスを利用すると便利だ。(プログラムでも管理できるけど、このほうが超便利。)

http://<install server>:<port>/raven/studio.html

例によって、Data フォルダー (.\Server\Data) を消すとデータは初期化され、ポータブルな運用ができるので、いろいろ作ってみては、データを消して試すことができる。

以降のサンプルコードでは RavenDB Client (.NET の API) を使ってアクセスをおこなうが、HTTP ホストの場合、クライアント側は、API を使わず、HTTP をそのまま呼び出して、RESTful な方法でデータアクセスができるため、jquery などを使って実装しても良い。

RavenDB への Plug-in (Bundles)

RavenDB では、プラグイン (Plug-in) 可能な追加の機能を bundle と呼んでおり、後述する Replication でも、この bundle を使用する。 bundle の追加は非常に簡単で、RavenDB を実行するディレクトリの下に Plugins フォルダーを作成し、ここに必要な dll を配置するだけだ。(このため、 RavenDB を起動する場所によって Plugins フォルダーの場所が変わるので注意。一般には、RavenDB の実行モジュールが入っている .\Server の下に作成しておけば良い。)

インターネット上から bundle をダウンロードして Plugins フォルダーに配置するためのスクリプトが用意されていて、例えば、今回使用する Replication bundle をプラグインするには、PowerShell を管理者権限で起動し、下記の通り実行する。

# enable script execution
Set-ExecutionPolicy Unrestricted

# download Raven.Bundles.Replication.dll
# to Plunins folder
cd .\Server
..\Raven-GetBundles.ps1 Replication

なお、上記の RavenDB の zip を展開すると、インストールフォルダーの Bundles フォルダーに、既にいくつかの bundle の dll が入っているので、 Plugins サブフォルダーを作成し、ここに手動でコピーしても良い。(ここには、Replication bundle も入っている。)

Replication

では、早速、Replication から説明しよう。
以降では、サーバーを 8080、8081 の各ポートで 2 台起動していると仮定する。

まずは、使用するすべてのサーバーで、上述した Replication bundle がプラグイン (Plug-in) されていることを確認する。

つぎに、Replicatoin の構成をおこなう。Raven DB では、構成情報もドキュメント (Json ドキュメント) として登録するようになっていて、こうした管理用のドキュメントは System Document (Sys Doc) と呼ばれている。今回は、Replication 用のドキュメントを登録する。

ブラウザーを起動し、8080 のサーバーの RavenDB Management Studio (URL は上述) を使用して、 Documents タブを選択し、[Create a Document] をクリックして、ドキュメントを作成する。
Key を「Raven/Replication/Destinations」として、下記の Json ドキュメントを追加する。

{
  "Destinations": [
    {
      "Url": "http://localhost:8081/"
    }
  ]
}

今回は、8080 のサーバーが master となり、8080 の更新を 8081 に Replication する。(そのため、8081 のサーバーに、上記の構成は必要ない。) 合計 2 台構成なので 1 台分の slave しか追加していないが、 3 台構成以上の場合は、上記の Json 配列に複数のマシンを追加すれば良い。

構成を変更したら、RavenDB を再起動する。今回は、デバッグ実行しているので、「q」で抜けてから再度起動すれば良い。もしサービスとしてインストール (Windows サービスとして起動) している場合は、管理者権限で以下のコマンドを実行すれば再起動できる。

Raven.Server.exe /restart

以上で、Replication の設定は完了だ。

では、実際にデータを更新して確認してみる。(8080 と 8081 のサーバーをデバッグモードで起動しておこう。)

今回は、前回と違って HTTP ホストの RavenDB を使うので、クライアント側は RavenDB Client のみで充分だ。(NuGet からインストールできる。なお、前述の通り、jquery などを使ってアクセスしても良い。)
RavenDB Client を使って、下記の通りデータを登録してみる。(なお、前回のように、接続先の情報を .config に記述しておいても良い。)

using Raven.Client;
using Raven.Client.Document;

static void Main(string[] args)
{
  using (var ds = new DocumentStore
  {
    Url = "http://localhost:8080/"
  })
  {
    ds.Initialize();

    using (IDocumentSession session
        = ds.OpenSession())
    {
      Order o1 = new Order
      {
        Name = "test1",
        Price = 100,
        Category = "material"
      };
      session.Store(o1);
      session.SaveChanges();
      //session.Dispose();
    }

    Console.WriteLine("Done !");
    Console.ReadLine();
    //ds.Dispose();
  }
}

public class Order
{
  public string Name { get; set; }
  public int Price { get; set; }
  public string Category { get; set; }
}

8080 と 8081 のコンソールを見ながら、サーバー上でどんな要求が処理されているか確認すれば一目瞭然だ。8080 のサーバーに POST/PUT の要求がおこなわれると同時に、8081 のサーバーにも POST 要求が入る。その結果、Management Studio で見ると、双方に、同じデータが登録されているのが確認できる。

参考までに、8081 のサーバー上 (slave 上) のコンソール結果を出力すると下記の通りになる。

c:\Demo\RavenDB-8081\Server> Raven.Server.exe --debug
Raven is ready to process requests. Build 960, Version 1.0.0 / bce65ae
Server started in 3,298 ms
Data directory: c:\Demo\RavenDB-8081\Server\Data
HostName: <any> Port: 8081, Storage: Esent
Server Url: http://machine01:8081/
Available commands: cls, reset, gc, q

Request # 1: GET  - 614 ms - <default> - 200 - /replication/lastEtag?from=http%3A%2F%2Fmachine01%3A8080%2F&currentEtag=00000000-0000-0300-0000-000000000002
Request # 2: POST - 278 ms - <default> - 200 - /replication/replicateDocs?from=http%3A%2F%2Fmachine01%3A8080%2F
Request # 3: GET  - 266 ms - <default> - 200 - /replication/lastEtag?from=http%3A%2F%2Fmachine01%3A8080%2F&currentEtag=00000000-0000-0300-0000-000000000003
Request # 4: GET  -   5 ms - <default> - 200 - /replication/lastEtag?from=http%3A%2F%2Fmachine01%3A8080%2F&currentEtag=00000000-0000-0300-0000-000000000003
Request # 5: GET  -   3 ms - <default> - 200 - /replication/lastEtag?from=http%3A%2F%2Fmachine01%3A8080%2F&currentEtag=00000000-0000-0300-0000-000000000003

今回はテストのため 2 台としているが、単一のサーバー上で更新がおこなわれると、ファームのすべてのサーバーに更新のバッチが送信される。(この処理は、background で並列に処理される。)

また、今回は、8080 のサーバーを master として、このサーバーに発生した更新処理と同期する 8081 のサーバー (slave) を構成したが、 master – master の構成も可能だ。この際、もし、サーバー間の更新が Conflict した場合は、下記のドキュメントの通り処理すれば良い。

[RavenDB] Dealing with replication conflicts
http://ravendb.net/docs/server/bundles/replicationconflicts

また、Failover の仕組みも提供している。
例えば、クライアントを下記の通り作成し、最初の ReadLine() の箇所で 8080 のサーバーを shutdown してみる。すると、以降の処理で例外は発生せず、データは、ちゃんと 8081 から取得される。(ただし、下記の Initialize() によって Replication Server の情報を読み込むので、Initialize の際に 8080 のサーバーが起動していなければならない。)

using (var ds = new DocumentStore
{
  Url = "http://localhost:8080/"
})
{
  ds.Initialize();

  Console.ReadLine(); // Wait and shutdown 8080 !!

  using (IDocumentSession session
      = ds.OpenSession())
  {
    Order item = session.Load<Order>("orders/1");
    Console.WriteLine("Price is ${0}.", item.Price);
  }

}

また、以下の通り記述すると、データ取得の際、8080 と 8081 のサーバーに交互に GET 要求が送信される。(この手法は、read striping と呼ばれている。)

using (var ds = new DocumentStore
{
  Url = "http://localhost:8080/",
  Conventions =
  {
    FailoverBehavior = FailoverBehavior.ReadFromAllServers
  }
})
{
  . . .

なお、master-master の場合は、上記で FailoverBehavior.AllowReadsFromSecondariesAndWritesToSecondaries を指定すると良い。

また、途中まで Replication を使わず実行し、途中から Replication を構成した場合など、サーバー間でデータの相違 (矛盾) が生じるように思われるかもしれないが、ちゃんと、最初の同期処理でデータを同一に揃えてくれる。(初回に、slave に対し、同期するデータの回数分、POST 要求が送信される。)

ちなみに、RavenDB のドキュメントを見ると、 PutIndex、DeleteIndex は Replication でサポートされていないようなので注意してほしい。なお、当然だが、Query をおこなうと、ちゃんと dynamic index は作成される。(ただし、その Index は Replication されない。対象のサーバーへ Query をおこなう度に、そのサーバーごとに Index が作成される。)

Sharding (Scale out using shared data set)

さて、いよいよ、RavenDB の Scale out の話に入りたい。

MongoDB 同様、RavenDB にも Sharding が提供されている。(Sharding とは、方針に沿って、データを複数のサーバーに分散すること。) 最新の RavenDB では、Sharding 環境で Indexing や Linq Query もサポートされている。
以降で、ちょっと詳しく見てみよう。

まず、単純に、データを任意に (おまかせで) 分散させる Blind Sharding を見てみよう。これも、超簡単 ! 下記の通り、DocumentStore の代わりに、ShardedDocumentStore というオブジェクトを使えば完了だ。(bundle も不要。)
下記で、o1、o2 は、それぞれ別々のサーバーに振り分けられる。
つまり、Sharding は、サーバー側で実行されているのではなく、すべてクライアント側でおこなわれる。

using Raven.Client.Shard;

var stores = new Dictionary<string, IDocumentStore>
{
  {
    "server1",
    new DocumentStore {Url = "http://localhost:8080"}
  },
  {
    "server2",
    new DocumentStore {Url = "http://localhost:8081"}
  }
};

var shrd = new ShardStrategy(stores);

using (var ds =
  new ShardedDocumentStore(shrd))
{
  ds.Initialize();

  using (IDocumentSession session
      = ds.OpenSession())
  {
    Order o1 = new Order
    {
      Name = "test1",
      Price = 100,
      Category = "material"
    };
    session.Store(o1);
    session.SaveChanges();

    Order o2 = new Order
    {
      Name = "test2",
      Price = 200,
      Category = "material"
    };
    session.Store(o2);
    session.SaveChanges();
  }
}

つぎに、方針 (ポリシー) に沿って Sharding をおこなう Smart Sharding をプログラミングする。
まず、簡単な例として、Order の Category ごとに、データを別々のサーバーにわけて配置するサンプルコードを下記に記載する。
下記の場合、o1、o3 は 8080 のサーバーに配置され、o2 のみ 8081 のサーバーに配置される。

var stores = new Dictionary<string, IDocumentStore>
{
  {
    "material",
    new DocumentStore {Url = "http://localhost:8080"}
  },
  {
    "food",
    new DocumentStore {Url = "http://localhost:8081"}
  }
};
var shrd = new ShardStrategy(stores)
  .ShardingOn<Order>(o => o.Category);
using (var ds =
  new ShardedDocumentStore(shrd))
{
  ds.Initialize();

  using (IDocumentSession session
      = ds.OpenSession())
  {
    Order o1 = new Order
    {
      Name = "ball pointpen",
      Price = 100,
      Category = "material"
    };
    session.Store(o1);
    session.SaveChanges();

    Order o2 = new Order
    {
      Name = "ice cream",
      Price = 150,
      Category = "food"
    };
    session.Store(o2);
    session.SaveChanges();

    Order o3 = new Order
    {
      Name = "notebook",
      Price = 80,
      Category = "material"
    };
    session.Store(o3);
    session.SaveChanges();
  }
}

さらに、おもしろい実験をしてみよう。
例えば、Order と Product を下記の通り定義し、Order.Product に Product の Id を設定する。(前回説明したように、RavenDB では、ドキュメントに、常に、Id が付与される。)

public class Order
{
  public string Id { get; set; }
  public string Product { get; set; }
  public int Count { get; set; }
}

public class Product
{
  public string Id { get; set; }
  public string Name { get; set; }
  public int Price { get; set; }
  public string Category { get; set; }
}

そして、下記の通り実行してみる。すると、p1、p3、o1 は 8080 のサーバーに保存され、p2、o2 は 8081 のサーバーに保存される。

var stores = new Dictionary<string, IDocumentStore>
{
  {
    "material",
    new DocumentStore {Url = "http://localhost:8080"}
  },
  {
    "food",
    new DocumentStore {Url = "http://localhost:8081"}
  }
};
var shrd = new ShardStrategy(stores)
  .ShardingOn<Product>(p => p.Category)
  .ShardingOn<Order>(o => o.Product);
using (var ds =
  new ShardedDocumentStore(shrd))
{
  ds.Initialize();

  using (IDocumentSession session
      = ds.OpenSession())
  {
    // Create Product
    Product p1 = new Product
    {
      Name = "ball pointpen",
      Price = 100,
      Category = "material"
    };
    session.Store(p1);
    Product p2 = new Product
    {
      Name = "ice cream",
      Price = 150,
      Category = "food"
    };
    session.Store(p2);
    Product p3 = new Product
    {
      Name = "notebook",
      Price = 80,
      Category = "material"
    };
    session.Store(p3);
    session.SaveChanges();

    // Create Order
    Order o1 = new Order
    {
      Product = p3.Id,
      Count = 3
    };
    session.Store(o1);
    Order o2 = new Order
    {
      Product = p2.Id,
      Count = 2
    };
    session.Store(o2);
    session.SaveChanges();
  }
}

さて、勘の良いプログラマーならそろそろ気づいたと思うが、Smart Sharding では、読み取り (検索) と連携することで効果を発揮する。実際に、その効果を見てみよう。

例えば、Load をおこなってみる。
まず、予備知識として、Sharding を使用した場合、Id の既定値は、前回説明した <class name>/<sequence number> ではなく、<server name>/<class name>/<sequence number> となるので注意してほしい。
このため、例えば、p3 を取得する場合は、下記のプログラムコードになる。

Product item = session.Load<Product>("material/products/3");

さて、この際、サーバーにどのような処理が渡されたか、サーバー上のコンソールウィンドウ (上述した debug 実行のコンソール) で確認してみてほしい。実は、この GET 要求は 8080 のサーバーにしか飛ばない。Id から、このオブジェクトが 8080 のサーバーにあることがわかっているためだ。

では、下記はどうだろう ? この Load でも、Order オブジェクトと、関連する Product オブジェクトは 8080 のサーバーにしかないため、8081 への問い合わせ (GET) はおこなわれない。下記の Include メソッドによって、8080 のサーバーへの 1 回の HTTP GET のみで結果を取得する。

// Prefetch Product object using Include
Order order = session.Include<Order>(o => o.Product)
  .Load("material/orders/1");
// This dosen't ask to server !
Product product = session.Load<Product>(order.Product);
Console.WriteLine("Price is ${0}. Count is {1}.",
  product.Price, order.Count);

Query でも同様だ。下記のサンプルコードの場合、検索や Index 作成は 8080 のサーバーでしかおこなわれず、余計なラウンドトリップは発生しない。

var q = from c in session.Query<Product>()
    where c.Category == "material"
    select c;
foreach (var item in q)
{
  Console.WriteLine("{0} : {1}",
    item.Name,
    item.Price);
}

and / or など複雑な query をおこなった場合も同様だ。今回は 2 台のサーバーだけで確認しているが、Linq Query の構文を解析し、m 台あるうちの n 台から結果を取得すれば良いと判断されると、 RavenDB Clinet は、その n 台のみに検索をおこない、取得したデータを結合して返してくる。(or を使った場合、など。)

まさに、Smart ! (クライアントライブラリーが、賢いってことだね)

一方、下記の Query は、8080、8081 の双方のサーバーで実行されて、答えを返してくる。

var q = from c in session.Query<Product>()
    where c.Category == "notebook"
    select c;
foreach (var item in q)
{
  Console.WriteLine("{0} : {1}",
    item.Name,
    item.Price);
}

また、プログラマーなら、orderby を使った場合の動きが気になるよね。サーバー A とサーバー B で Index を使って Sort 結果を取得し、最後に、これらを結合すると仮定すると、その結果、順番がばらばらになるような気がする。しかし、実際に、このような状態を作って orderby を実行してみると、 orderby の結果もちゃんと正しい答えが返ってくる。
きっと、ここは、RavenDB Client が頑張っちゃっているのかもしれない。(もしそうだとすると、この点は、データ数が多い場合に要注意ということだ。)

あと、Sharding Strategy を運用途中で変更すると、その変更内容によっては、当然、検索結果はおかしくなってしまうので注意してほしい。(例えば、上記で、”material” と “food” のサーバーを入れ替えると、Query の際に誤った結果が返ってくることになる。) もちろん、それまで登録されていたデータと矛盾しない変更であれば問題ない。開始時点で、ちゃんと以降の運用も検討に入れて使ったほうが良さそうだ。

あと、Sharding と Replication の併用も可能だ。

いろいろ解説しはじめるときりがないが、これだけの事が、かなり安価に実現できる点は魅力的だ。(「費用」という意味ではなく、構成の理解やセットアップなど、トータルの意味で「安価」と書いている。)
他の RDB などでも、いまどき Replication や Distribution の仕組みくらいは持っているが、この手のデータベースの良い点は、速度はもちろんだが、とにかくポータブルで、「わかりやすい」という点だろう。動きがわかりやすいと、そのシステム固有の訳のわからない動きに悩まされることも少ないし、判断も早い。

Posted by

Subaru Kokubun

Posted on

June 5, 2012

Posted under

Data

Comments

3 Comments

RavenDB の特徴と使い方 (プログラミング)

最近はまっている RavenDB について書いておこうと思う。

RavenDB は、軽量なドキュメントデータベース (NoSQL) で、MongoDB などを使っている人は、似た概念のものと思ってもらって良い。(もちろん、細かな点は違うけど。) 構造上、「速い」というのはもちろんだが、その特徴として、.NET との親和性が良く、.NET アプリケーションに埋め込めるという点がある。(ASP.NET MVC の開発者にとっては、超うれしい。)
また、実際に使ってみるとわかるが、そうした簡単な表現では足りないくらい、さまざまなメリットと特徴があるので、今日は、その辺りを、伝えられる限り書いておこうと思う。(海外の一部のマニア達の間では、流行ってるみたいだ。。。)

インストールとデータベースの準備

まず、構成をちゃんと理解してもらうために、インストール方法から書いておこうと思う。

RavenDB には、.NET のアプリケーション (ASP.NET 含む) に埋め込んで使う方法と、 HTTP にホスト (IIS に配置, もしくはサーバーのバイナリを起動) して使用する方法がある。(HTTP ホストについては、「RavenDB の Replication, Scale Out」に記載した。なお、Multiple Database の構成など、HTTP にホストしないと使えないものもあるので要注意。)
今回は、アプリケーションに埋め込んで使用するが、この場合は、NuGet からインストールできる。

install-package RavenDB.Embedded

Attempting to resolve dependency 'RavenDB.Database (= 1.0.888)'.
Attempting to resolve dependency 'Newtonsoft.Json (= 4.0.8)'.
Attempting to resolve dependency 'NLog (= 2.0.0.2000)'.
Attempting to resolve dependency 'RavenDB.Client (= 1.0.888)'.
Successfully installed 'Newtonsoft.Json 4.0.8'.
Successfully installed 'NLog 2.0.0.2000'.
Successfully installed 'RavenDB.Database 1.0.888'.
Successfully installed 'RavenDB.Client 1.0.888'.
Successfully installed 'RavenDB.Embedded 1.0.888'.
Successfully added 'Newtonsoft.Json 4.0.8' to ConsoleApplication5.
Successfully added 'NLog 2.0.0.2000' to ConsoleApplication5.
Successfully added 'RavenDB.Database 1.0.888' to ConsoleApplication5.
Successfully added 'RavenDB.Client 1.0.888' to ConsoleApplication5.
Successfully added 'RavenDB.Embedded 1.0.888' to ConsoleApplication5.

この段階では、まだデータベースファイル等は生成されず、必要な dll の配置と参照設定が追加されるのみだ。

なお、先日リリースされた ASP.NET MVC 4 RC 版と一緒に使用する場合は、まだ RavenDB で使用しているライブラリーのバージョンと競合するため (Microsoft.AspNet.WebApi パッケージで使用しているライブラリーのバージョンと競合するため)、下記の通り、PreRelease 版の RavenDB を入れておく。(2012 年 06 月現在)

get-package -l -filter RavenDB.Embedded -pre

Id                             Version
--                             -------
RavenDB.Embedded               1.2.2010-Unstable

install-package RavenDB.Embedded -version 1.2.2010-Unstable -pre

ここでは説明しないが、もちろん、Import、Export、Backup など、データベース管理における一般的なタスクも実行できる。(HTTP ホストの場合、RavenDB Management Studio を使って簡単に実行できる。)

基本的な使い方

「軽量」(light-weight) と記載したが、では、どんな感じで light なのか見てみよう。

RavenDB を使用するには、RavenDB Client (Client 用のライブラリー) を使用するが、 IIS にホストしている場合は、REST (Web API, HTTP API) として使用することもできる。(.NET プログラマーの方は、ちょうど、WCF Data Services のような使い方だと思って良い。)

データベース (データ、インデクス等) はファイルとして作成されるが、これらの必要なファイルは実行時に作成される。(一度作成されたら、以降は、作成されたデータベースファイルを使用する。)
データベース用のディレクトリのみを準備しておき、このディレクトリを指定して RavenDB を初期化することで、必要なデータベースファイルが作成される。特にそれ以上の特別な準備は不要で、Entity Framework、Hibernate などの OR Mapper (ORM) 同様、簡単なコードでデータベースの Provisioning が完了する。

また、データ構造やスキーマ定義も不要で、.NET インスタンスを保存すると、内部で Json 形式にシリアライズされるため、シリアライズ可能なオブジェクトであれば何でも保存できる。

例えば、下記は、RavenDB Client を使って、Order クラスのインスタンスを保存する簡単なサンプルコードだ。(今回は、アプリケーションの実行ディレクトリの下に「Database」という名前のサブディレクトリを作成し、ここを使用する。)

using Raven.Client;
using Raven.Client.Embedded;

static void Main(string[] args)
{
  using (IDocumentStore instance =
    new EmbeddableDocumentStore
    {
      DataDirectory=@"~\Database"
    })
  {
    // all db files are created, here !
    instance.Initialize();

    using (IDocumentSession session
      = instance.OpenSession())
    {
      Order o1 = new Order
      {
        Name = "test1",
        Price = 100,
        Category = "material"
      };
      session.Store(o1);
      session.SaveChanges();
      //session.Dispose();
    }

    Console.WriteLine("Done !");
    Console.ReadLine();
    //instance.Dispose();
  }
}

public class Order
{
  public string Name { get; set; }
  public int Price { get; set; }
  public string Category { get; set; }
}

上記の Initialize() メソッドの実行によって、データベース関連の一連のファイルが作成される。(これらのファイルを消せば、データベースは、また初期の状態で再作成される。いたってシンプルだ。)

接続の際は、下記のように、構成ファイル (.config) に接続情報を記述しても良い。

<?xml version="1.0"?>
<configuration>
  <connectionStrings>
    <add name="RavenDB" connectionString="DataDir = ~\Database" />
  </connectionStrings>
</configuration>

上記の接続文字列 (RavenDB) を使ってデータベースを初期化する際は、下記の通り記述する。

static void Main(string[] args)
{
  using (IDocumentStore instance =
    new EmbeddableDocumentStore
    {
      ConnectionStringName = "RavenDB"
    })
  {
    . . .

なお、RavenDb を HTTP にホスト (配置) している場合は、下記のように DocumentStore を使って接続する。(この場合も、同様に、接続先の情報を .config に記述しても良い。)

using Raven.Client.Document;

var ds = new DocumentStore
  { Url = "http://testsite/Raven" };

Key-Value

RavenDB は NoSQL であり、Key-Value を採用している。データは Json フォーマットで保存されるが、内部で、自動的に、Id (identifier) が Key として付与される。(これが、取得の際の Key として使用される。)
つまり、データは、Id (identifier) という文字列 (Key) と Json ドキュメント (Value) の Key-Value として格納されている。
例えば、下記は、Id が「orders/1」のインスタンスを取得している。(当然だが、速い。)

Order res;
using (IDocumentSession session = instance.OpenSession())
{
  res = session.Load<Order>("orders/1");
}

Id は、プログラマーが指定しない場合、自動的に <class name>/<sequence number> となる。<class name> には、クラス名の小文字の名前を複数形にした名前が設定される。例えば、「Order」クラスなら「orders」となる。つまり、上述したアイテムの新規登録のコードの場合、Id (Key) は「orders/1」となる。
また、Id は、プログラマーが明示的に指定することも可能だ。

session.Store(o1);    // id is "orders/1"
session.Store(o1, "id1"); // id is "id1"

また、シリアライズ対象の .NET クラスに Id という名前の文字列型のメンバーがある場合、これが自動的に Id として Key に割り当てられる。例えば、下記のコードの場合、Id が同一のため、o2 によって、o1 が変更 (update) される。(つまり、登録されるデータは 1 件のみ。)

static void Main(string[] args)
{
  ...

  using (IDocumentSession session
    = ds.OpenSession())
  {
    Order o1 = new Order
    {
      Id = "id1",
      Name = "test1",
      Price = 100,
      Category = "material"
    };
    session.Store(o1);
    session.SaveChanges();
  }

  using (IDocumentSession session
    = ds.OpenSession())
  {
    Order o2 = new Order
    {
      Id = "id1",
      Name = "test1",
      Price = 200,
      Category = "food"
    };
    session.Store(o2);
    session.SaveChanges();
  }
  ...
}

public class Order
{
  public string Id { get; set; }
  public string Name { get; set; }
  public int Price { get; set; }
  public string Category { get; set; }
}

しかし、下記のコードでは、Id プロパティが指定されていないため、o1、o2 の 2 件のデータが作成 (Create) される。（Id には、「orders/1」、「orders/2」が付与される。）

static void Main(string[] args)
{
  ...

  using (IDocumentSession session
    = ds.OpenSession())
  {
    Order o1 = new Order
    {
      Name = "test1",
      Price = 100,
      Category = "material"
    };
    session.Store(o1);
    session.SaveChanges();
  }

  using (IDocumentSession session
    = ds.OpenSession())
  {
    Order o2 = new Order
    {
      Name = "test1",
      Price = 200,
      Category = "food"
    };
    session.Store(o2);
    session.SaveChanges();
  }
  ...
}

public class Order
{
  public string Name { get; set; }
  public int Price { get; set; }
  public string Category { get; set; }
}

なお、クラスの Id プロパティ (メンバー) を空にして登録 (Store) すると、登録時に、自動的に割り当てられた Id がインスタンス (インスタンスの Id メンバー) に設定される。

この Id だが、RavenDB が IIS にホストされている場合は Uri の断片そのものなのでわかりやすいが、 Web アプリケーションに埋め込む場合には、Query String で使用する際に邪魔になることがある。例えば、orders/1 という Id のアイテムを GET する場合、下記の URL はエラーとなってしまうだろう。

GET /webapplication/Order/?id=orders/1

この場合、MSDN マガジンに書かれているように、 IdentityPartsSeparator プロパティを使って、Id で使用する Separator を変更できる。

using (IDocumentStore instance =
  new EmbeddableDocumentStore
  {
    ConnectionStringName = "RavenDB"
  })
{
  instance.Conventions.IdentityPartsSeparator = "-";
  instance.Initialize();
  ...

Query と Index

この手のデータベースで、いつも困るのが検索だ。
Key-Value の場合、構造上、何かと融通が効かないことが多いが、RavenDB では、高度な Index 管理をサポートすることで、こうした pain を回避している。Index と言っても、RDB の Index とは考え方が異っているので、以下に記載する。(ドキュメントデータベースなので、全文検索用の Index の概念だ。)

まず、RavenDB では、Linq の Query を使って以下のように書ける。

var test = from c in session.Query<Order>()
      where c.Name == "test1"
      select c;
foreach (var item in test)
{
  Console.WriteLine("{0} : {1}", item.Name, item.Price);
}

上記では Key (Id) が使えないため、登録されているデータの Name を 1 つ 1 つ調べて答えを返しているように思えるが、この手のデータベースでは、「データ全件を調べる」ということはしない。(厳密には、LuceneQuery という Index ファイルをそのまま検索すると、全件検査を実行できてしまうが。。。)
では、どのように動いているのだろうか ?

実は、上記のような検索をおこなうと、内部で、動的に Index が作成されて、その Index が使用される。(これは、dynamic index と呼ばれている。ちなみに、前述の Id で検索した場合であっても、上記のような Linq Query を使うと、それに応じた Index が必ず作成される。) また、作成された Index は、しばらく残り、同じ Index を使用する別の検索がおこなわれると、その Index が再利用される。最終的に、何度も同じ Index を使用すると、RavenDB によって、dynamics index は永続化される。(以降、ずっと残る。)
つまり、アプリケーション側で同じ使い方をしていると、そのアプリケーションに最適化された Index が自動的に生成され、永続化されて、使用されるようになる。

こうした仕組みのため、dynamic index を使う場合は、初回の検索のみ遅くなるので注意が必要。また、こうした仕組みのため、EUC による動的検索など、都度、検索文 (SQL) を動的生成するようなアプリケーションにも向いていない。

なお、作成された Index は、下記のコマンドで取得できるので、観察してみるとわかる。(dynamic index が permanent に昇格されたかどうかも、この名前で確認できる。) Index の明示的な削除も可能だ。

string[] indexes = instance.DatabaseCommands.GetIndexNames(
  0,
  int.MaxValue);
foreach (var indexname in indexes)
{
  Console.WriteLine("Index : {0}", indexname);
}

さて、ここまでの説明だと、RDB の Index を想像する人も多いと思うが、実は全然違う。
以下に、この Index の正体をもう少し細かく見てみよう。

Index は、上記 (dynamic index) のように動的に作成することもできるが、プログラマー自身が Index を作成し、これを使用できる。(この Index を static index と呼ぶ。) このため、以降では、この方法で Index を作成して見てみよう。

上記と同じ Name を使った検索を、static index を使って書くと、以下の通りになる。

using Raven.Client.Indexes;

static void Main(string[] args)
{
  ...

  // all db files are created, here !
  instance.Initialize();

  // create index !
  instance.DatabaseCommands.PutIndex(
    "Orders/ByName",
    new IndexDefinitionBuilder<Order>
    {
      Map = (orders => from order in orders
                select new { order.Name })
    });

  . . .

  using (IDocumentSession session = instance.OpenSession())
  {
    // using static index !
    var test = from c in
            session.Query<Order>("Orders/ByName")
          where c.Name == "test1"
          select c;
    foreach (var item in test)
    {
      Console.WriteLine("{0} : {1}", item.Name, item.Price);
    }
  }
  . . .

}

ちなみに、上記で、正しい検索結果にならない場合は、数秒待機してから検索 (Query) してみてほしい。理由は後述する。

この Index のメカニズムを簡単に解説する。RavenDB の Index は、実は、内部では、全文検索 (Full Text Search) エンジンの Lucene.Net が使用されている。上記の Map 関数により、Lucene.NET に登録する Document のフィールドが設定される。(このフィールドを使って、検索可能になる。) そして、検索の際には、Lucene.Net に登録されている Index を使って Document を検索する。
既定では、Map 関数 (上記) で抽出された文字列型のフィールドをそのまま Token として登録するが、いわゆる全文検索エンジンのメリットを活用して、Token 解析を別のものに変更することも可能だ。例えば、下記では、Name を空白 (whitespace) で Token 分割して Lucene.Net に登録し、この分割された Token を使って検索 (Query) 可能にしている。この場合、例えば、Name が「Ballpoint pen」だった場合、「pen」で検索しても抽出されるようになる。

using Lucene.Net.Analysis;
. . .

instance.DatabaseCommands.PutIndex(
  "Orders/ByName",
  new IndexDefinitionBuilder<Order>
  {
    Map = (orders => from order in orders
              select new { order.Name }),
    Analyzers =
    {
      {
        orders => orders.Name,
        typeof(WhitespaceAnalyzer).FullName
      }
    }
  });

この Map の関数は、個々のデータごとにそれぞれ独立して処理できるため、複数のスレッド (タスク) によって分散して Index 生成の処理をおこない、高速化できる。

また、例えば、「Order に設定されている Category を集計し、各 Category と登録されている Order の個数を出力する」といった複雑な検索の場合には、下記のように Map と Reduce を組み合わせることができる。

static void Main(string[] args)
{
  . . .

  // create index
  instance.DatabaseCommands.PutIndex(
    "Orders/ByCategoryCount",
    new IndexDefinitionBuilder<Order, CategoryCount>
    {
      Map = (orders => from order in orders
                select new CategoryCount()
                {
                  Category = order.Category,
                  Count = 1
                }),
      Reduce = (results => from result in results
                  group result by result.Category
                  into g
                  select new
                  {
                    Category = g.Key,
                    Count = g.Sum(x => x.Count)
                  })
    });

  . . .

  using (IDocumentSession session = instance.OpenSession())
  {
    var test = (from c in
            session.Query<CategoryCount>("Orders/ByCategoryCount")
          where c.Category == "material"
          select c).FirstOrDefault();
    Console.WriteLine("{0} : {1}",
      test.Category,
      test.Count);
  }
}

public class CategoryCount
{
  public string Category { get; set; }
  public int Count { get; set; }
}

Reduce は、Map で作成した結果を集約する関数だ。Reduce 関数では、Map で作成された結果をグループ化し、グループごとに独立して処理できる。また、その結果をさらにグループ化し、再度、独立して処理をおこなう。そして、これを繰り返す。つまり、この処理も、複数スレッドで分散して効率的に集約処理を実行できる。 (なお、Reduce は、このように再帰的に処理されるため、入力と出力の型は同じになっている点に注意してほしい。)

このように、RavenDB の Index は、Map Reduce などのタスクを登録し、この登録されたタスクが作成する結果のビューを使って処理をおこなうイメージだ。(ただし、基本的に、単一マシン、複数スレッドでの実行なので注意してほしい。マシン分割を検討する場合は、Sharding を使うことになる。)

なお、こうした仕組みのため、いくつかの注意点もある。例えば、この Map Reduce の処理 (タスク) は、検索処理と無関係にバックグラウンドで実行されるため、検索結果が Stale の状態 (つまり、古い Index の状態) になっている場合があるので注意する。こうした場合、上述したように数秒待ってみるか、あるいは、ここでは説明を省略するが、プログラムで Stale かどうかの確認が可能なので、こうした処理をまめに入れておいてほしい。また、RDB のような Contains を使った検索 (前方一致以外の部分文字列検索) もできない。理由は、上記を見てもらえば明白だろう。ただし、上記のように Analyzer を変更することで、意味的に Token 分割をおこない、Token 単位で検索することはできる。
要は、RavenDB の Index は、OLTP を得意とする RDB のような使い方ではなく、あくまでも「ドキュメント」を扱うのに適した Full Text Search の Index であることを理解しておくと良い。まあ、普通の使い方をしていれば、そのアプリケーションのために最適化されたデータベースとして動作するのだが、こうした内部の動きを理解しておく必要はあるということだ。

なお、Index 作成は background スレッドで実行されるため、Indexing の際のエラーは、プログラムから取得するか (/stats)、HTTP ホストの場合は、RavenDB Management Studio を使用して確認する。
また、Index であまりエラーが頻発する場合、RavenDB が Index を Disable にしてしまう場合がある。その場合、Index を消すか、Index Definition そのものを変更するしかない。
ちなみに、Index は、再構成 (ResetIndex) も可能だ。

その他 (light な世界の、light な制御)

この他に、ここでは説明を省略するが、RavenDBは、階層構造 (Indexing Hierarchical Data) なども高速に扱うことができる。
また、RavenDB は、もちろん、Pessimistic ではなく、Optimistic な制御モデルを採用している。ETag を使った楽観同時実行制御 (Optimistic Concurrency Control) のための仕組みも備わっている。
また、RavenDB を IIS にホストする場合、オブジェクト同士が参照関係にある場合に、1 回の REST 呼び出しで関係するオブジェクトを取得できる。(つまり、関係するオブジェクトの Pre-fetch が可能。)

ここでは詳細の説明を省略するが、”ドキュメントデータベースらしさ” は Index だけではないので、いろいろ触ってみるとおもしろい。

ASP.NET との Integration (ASP.NET MVC, ASP.NET Web API)

上記の通り、軽量、柔軟、かつアプリケーションに近いデータベースのため、ASP.NET MVC などの RESTful で軽量なアプリケーションフレームワークとの相性は良い。(MSDN マガジンの記事では、まさにこの内容について解説されている。) まあ、要は、組み合わせて、アプリケーションのリポジトリーとして使うだけだが、上記のセパレーター (IdentityPartsSeparator プロパティ) の話以外にも、いくつか注意点があるので、最後に記載しておく。

まず、データベースの Initialize (上記の Initialize() メソッド) は、時間がかかるので注意してほしい。特に、ASP.NET MVC では、stateless に実装することが多いので、Initialize は Application_Start などで実行し、取得したデータベースオブジェクト (DocumentStore、EmbeddableDocumentStore) も static 変数に入れて再利用するなど、初期化方法を工夫してほしい。せっかく速いデータベースでも、「宝の持ち腐れ」となってしまうので注意が必要だ。

また、せっかく RavenDB を使うなら、データベースアクセスなどはビジネスロジックに混在させず、透過的に使えるような工夫もできるだろう。Event、Handler、ModelBinder だけでなく、IoC (Dependency Resolver) を活用すれば、より高度な処理の分離も可能だ。(IoC については、ここに日本語で解説されている。)

@Subaru

Enterprise IT & Software Technologies

Category Archives: Data

Use secured storage in public AML workspace

Provision secured storage in AML

[Optional] Create a Private Endpoint

Secure training with protected data

Secured container registry (ACR)

Things to Know About Serverless SQL Pool in Azure Synapse Analytics

Getting Started

Credentials for Data Source

Connect Programmatically

Elasticity

Supported File Formats and Concerns

Supported T-SQL and Concerns

Private Endpoint for Serverless SQL pool (Networking)

Performance Optimization

Programming for Apache Kafka (Quickstart using Cloud Managed Service)

Create Managed Kafka Cluster (HDInsight)

Create Topic

Run Producer Application

Run Consumer Application

Run Kafka Stream (Time Window Sample)

MongoDB でドキュメント DB の魅力を 30 分で学ぶ (記事紹介)

MongoLab で、Cloud な MongoDB 活用

RavenDB の Replication, Scale Out (Sharding)

RavenDB の特徴と使い方 (プログラミング)