Digest: DocumentDB Resource Model and Concepts

Azure DocumentDB has release a few weeks ago and with it an early, in small quantity, of good quality documentation.

One of those article is DocumentDB Resource Model and Concepts. That article goes through the different concepts of the inner model of DocumentDB.

That article sheds some light on the product but also reveal the extent of the announced features and their limitations. It’s definitely recommended reading if you want to understand the product.

Here I’m gona focus on a few key points I found important.

For starter, here is the concept map of DocumentDB.

Self Links

You notice the partial uris under each box in the diagram (e.g. /dbs/{id}, /users/{id}, etc.)? Those are part of the self link.

Each object in DocumentDB is addressable via a self link, a URI. For instance, from an account base URI you can reach the stored procedure MyProc in the collection MyCollection in the database MyDatabase by the uri <account base URI>/dbs/MyDatabase/colls/MyCollection/sprocs/MyProc.

This is of course a reflection of the fact that DocumentDB exposes a REST API. The SDK reflects the REST API faithfully so you do not need to traverse the object model to get to a sproc, you can get directly to it by its self link.

Capacity Units: Account

We configure the number of capacity units (that come with CPUs and storage, basically, VMs) at the account level.

So why would you configure multiple accounts in an architecture? To isolate capacity between workloads. For instance, if you have two workloads requiring a lot of torque that shouldn’t interfere with each other, put them in two different DocumentDB accounts.

Scaling Unit: Collection

DocumentDB collection is a scaling unit. It is the ultimate transaction border: a transaction can’t span two collections (left alone two databases). It is also the one with the size limit: 10GB in the preview.

We can guess (although it isn’t explicitly stated as such) that a collection is contained within one VM only, hence it’s capacity to hold a transaction efficiently and its finite size. Collections are likely replicated across capacity units but one replica of a collection can’t span two capacity units.

Hence if you want more storage, add collections… and start managing partitions yourself unfortunately.

And there comes my first product request: collections with eventual consistency policy shouldn’t have size limit and the sharding should be managed by DocumentDB itself (hidden from the consumer).

SSD backed Document Storage

It is mentioned at a few places the storage is backed by Solid State Drive (SSD). There is no mention of tiering so does it mean the entire DB is stored on SSD?

Automatic but configurable indexing

You do not need to hint DocumentDB at how to construct its indexes. It figures it out by optimizing your query plans.

One thing you can do though is set indexing policies. For instance you could tell DocumentDB it’s alright to update its indexes on a collection asynchronously, hence boosting write performances.

Features like this make DocumentDB look quite sophisticated for a V1 product.

Javascript as the language

Yes Javascript is a popular language these days. But in the case of DocumentDB it serves another purpose than following fashion.

Its documents are made of JSON, which is basically Javascript objects. Hence Javascript is the natural language to manipulate those objects, removing any mismatch between the data and the language manipulating it. Compare this with C# for instance, all JSON objects manipulations would have meant string manipulation.

Attachments

Unclear in the initial brochure, DocumentDB can store more than just JSON. It can attach Binary Large Objects (blobs) to the documents. The document then act as metadata to the attachment.

Users… more roles than users

Users in DocumentDB are aggregation of permissions. As a user you do not authenticate per se against DocumentDB. Hence the concept is more akin to role.

 

Conclusion

DocumentDB is a quite complete and elegant product. Despite being in preview mode and being a “0.9” version, it feels —by its feature set— like a strong v1 or even v2.

I hope this digest gave you a few pointers.

Azure DocumentDB: first use cases

A few weeks ago Microsoft released (in preview mode) its new NoSQL Database: DocumentDB.

Not Only SQL (NoSQL) databases are typically segmented in the following categories: Key-Value (e.g. Azure Table Storage, Redis), Column (e.g. HBase, Cassandra), Document (e.g. CouchDB, MongoDB) & Graph. By its name but mostly by its feature set, DocumentDB falls in the document category.

My first reaction was Wow, a bit late at the party!

Indeed, the technology space of NoSQL has slowly started to consolidate so it would seem a bit late to get a new product on this crowded market place, unless you have value-added features.

And DocumentDB does. Its main marketing points are:

  • SQL syntax for queries (easy ramp-up)
  • 4 consistency policies, giving you flexibility and choice

But then you read a little bit more and you realise that DocumentDB is the technology powering OneNote in production with zillions of users. So it has been in production for quite a while and should be rather stable. I wouldn’t be surprised to learn that it is behind the new Azure Search as well (released in preview mode the same day).

Now what to do with that new technology?

I don’t see it replacing SQL Server as the backbone of major project anytime soon.

I do see it replacing its other Azure NoSQL brother-in-law… Yes, looking at you Azure Table-Storage with your dead-end feature set.

Table Storage had a nice early start and stalled right after. Despite the community asking for secondary indexes, they never came, making Table Storage the most scalable write-only storage solution on-the-block.

In comparison DocumentDB has secondary indexes and the beauty is that you do not even need to think about them, they are dynamically created to optimize the queries you throw at the engine!

On top of indexes, DocumentDB, supporting SQL syntax, supports batch-operation. Something as simple as saying ‘delete all the verbose logs older than 2 weeks’ requires a small program in Table Storage and that program will run forever if you have loads of records since it will load each record before deleting it. In comparison, DocumentDB will accept a delete-criteria SQL (one line of code) command and should perform way faster.

Actually Logging is the first application I’m going to use DocumentDB for.

Having logs in Table Storage is a royal pain when time comes to consume the log. Azure Storage Explorer? Oh, that’s fantastic if you have 200 records. Otherwise you either load them in Excel or SQL, in both cases defying the purpose of having a scalable back-end.

 

Yes, I can see DocumentDB as a nice intermediary between Azure SQL Federation (where scalability isn’t sufficiently transparent) and Table Storage (for reasons I just enumerated). In time I can see it replacing Table Storage, although that will depend on the pricing.

I’ll try to do a logging POC. Stay tune for news on that.

How to improve Azure: Granularity of access

In this blog series I explore some of the shortcomings of the Microsoft Azure platform (as of this date, April 2014) and discuss ways it could be improved. This isn’t a rant against the platform: I’ve been using and promoting the platform for more than four (4) years now and I’m very passionate about it. Here I am pointing at problems and suggest solutions. Feel free to jump in the discussion in the comments section!

The past blog entries are:

In my last post, I discussed about the variety of security models in the Azure platform, i.e. the different ways to authenticate or to get access to a resource.  This time, I would like to discuss the granularity of access.

In any system, once you are authenticated, you are authorized to some access / actions.  Access comes in different flavours:  read/write, create/delete, etc.  .

A recurring theme in authorization schemes is the concept of a hierarchy of rights in order to get around the complexity related to fine granularity of access.  For example, in both Windows File System & SharePoint, each file can be denied to a group or specific users but typically access is managed at a higher level, e.g. folders, library, site, site collection, etc.  .

If we now look at different Azure services, we’ll find the same the same diversity we found in security models.

SQL Azure has the same authorization scheme than SQL Server, i.e. Database, Schemas, objects (e.g. tables, views).  Service Bus has three possible actions, manage, read & write and those can be hierarchically given to an entire namespace or some sub domain.  Active Directory has two permissions for the Graph API, read & read-write applied to the entire directory.

Now the weakness I’ve experimented is the lack of granularity in some services, mainly Azure Storage & Active Directory.

Active Directory Graph API grants read or read-write access to the entire directory.  I cannot give to an application the ability to manage only some groups:  it’s a master switch, once I gave the access, it’s total.

In the case of Azure Storage, there is two types of access:  via SAS or via access keys.  With the access key, you’re the king of the entire storage account:  you can create / delete containers and read-write wherever you want.  The SAS comes in two flavours.  Ad hoc ones have quite granular access:  they can be given to a file with read or read-write access or on a container with read, read-write, or list (to obtain the list of files).  SAS policies on the other hand are good for an entire container so once you give one to a system, it can do whatever you authorized it to do (e.g. write) to the entire container.  So ad-hoc SAS sounds good, don’t they?  Except they are short lived and must be created…  using the access keys!

 

The main problem with not having fine grain enough access controls is that you end up giving more access that you want to.  I give you access to write in this entire blob container but please write only in this sub folder.  If files start to disappear, you’ll be on the list of suspects though.

There is an overarching principle in Information Technology, Principle of least privilege, give access to an agent ONLY what they need to do their business.  In order to do that, you need the underlying platform to support that.

Microsoft Pattern & Practices has a corresponding pattern, the valet key pattern.  This is a quite specialized view on the principle of least privilege, limited to SAS.  The problem with SAS, as I mentioned in previous articles, is that it multiplies the number of secrets you need to keep as a system.  If, as a system, I access 10 resources, I need to keep 10 SAS, i.e. 10 secrets since if I divulge a SAS to a third party, said third party now has access it shouldn’t have.  An authentication / authorization mechanism minimizes the number of secrets to one:  the secret you need to authenticate.

 

So…  how can we improve?

Well, the conceptual solution is quite easy:

  • Make sure to have the most granular access scheme possible
  • Use other mechanisms, e.g. hierarchy, to ensure the granularity doesn’t make the access unmanageable

This is quite coupled with the previous article about unifying the security models:  you need to authenticate a user / agent in order to grant them access.  But even with SAS you could go more granular than is currently the case.

How to improve Azure: Security Models

In this blog series I explore some of the shortcomings of the Microsoft Azure platform (as of this date, April 2014) and discuss ways it could be improved. This isn’t a rant against the platform: I’ve been using and promoting the platform for more than four (4) years now and I’m very passionate about it. Here I am pointing at problems and suggesting solutions. Feel free to jump in the discussion in the comments section!

The past blog entries are:

 

What is the security model of Microsoft Azure?

This question is at the heart of the weakness I would like to discuss in this article:  there are no unique security model in Azure.  There are a plethora of models, depending on which Services you are consuming.  Let’s look at some examples:

  • Azure Storage:  you can either use an access key (there are two:  primary or secondary) in the Authorization cookie of HTTP requests, which gives you all privileges within an entire storage account or you can use Shared Access Signature (SAS) that you place in the query strings of your HTTP requests, which gives you limit access (see valet key).  There are actually two flavours of SAS, an ad hoc form and a more permanent one.
  • SQL Azure:  user name / password of an SQL Account.  This is the same mechanism used with on-premise SQL Server that has been discouraged by Microsoft to used in opposition to the Windows Integrated mechanism, which isn’t supported by SQL Azure.
  • Service Bus:  access token provided by Access Control Service (ACS) or a SAS (independent of the Azure Storage SAS).  Both mechanisms can give limited access to resources (see my past blog on how to secure Service Bus access).
  • Azure Active Directory Graph API:  JWT access token provided by Azure Active Directory using OAuth-2.

Those are just a few and they are the access to the service itself.  When you want to manage those services (e.g. creating an SQL Azure Database), you may have a different security model.

If you use Microsoft Azure lightly, i.e. one or two services, it might not appear as a problem.  Once you start using Azure a bit more as a development platform, then this lack of uniformity will hit you:

  1. You have a lot of protocols to understand and implement.
  2. You need to work around different limitations, e.g. I can be very granular on access on the Service Bus but with SAS Policies in Azure Storage, the SAS is good for an entire container.
  3. You have a lot of secrets to store, since many Services do not share secrets (see my blob on secrets for more details).
  4. You have a lot of credentials to manage, e.g. you need to implement retention policies and renew different secrets in different services which all have different management interfaces.  This increases the chances of error or more typically, of not automating such policies.

 

How could we fix that?

 

I would propose a dual security model for each Service.

The primary Security Model would be Claims based and not even limited to Microsoft Azure Active Directory but open to any Identity provider.  For ease of use, your Azure Subscription could be configured with a set of trusted Identity Provider (i.e. URI & signing keys, token type).  When you want to configure an access you could then have a standard dialog where you pick the Identity Provider and compose a Claims rule (e.g. I want the claim type & claim value or those types & value, or either this combination or that one, etc. .).

This would enable your solution to have Service Identities managed wherever you want (although Azure Active Directory would be easier since it is part of the platform) and the same Identity could be used to access SQL Azure, Service Bus, Media Services, etc.  .  Quite a bit like we do on-premise with Service Accounts.

The access token provided by the Identity provider would need to be passed in the Authorization cookie at each HTTP Request or differently for different protocol (e.g. TCP TDS for SQL, proprietary TCP for Service Bus, etc.).

image

The secondary Security Model could be unique to each Service but would typically address the shortcomings of the primary one.  The primary model I define here requires access to an Identity Provider (this one must be up), configuration of an account in that Provider, etc.  .  A simpler Security Model, e.g. SAS, would be quite good for faster ramp-up with a Service (although it has the weakness of multiplying the number of secrets).

 

With this Primary / Secondary Security Model we would have a robust security model (Primary) and one for quicker use (Secondary).  When using the primary security model, it is quite possible to have only one secret to store for an application:  the user name & password of the Service Identity associated with the application.  This would allow for central management of the secret and much less complex logic to store and use the secrets.

 

Hope this was useful!

How to improve Azure: Can you keep a secret?

In this blog series I explore some of the shortcomings of the Windows Azure platform (as of this date, March 2014) and discuss ways it could be improved. This isn’t a rant against the platform: I’ve been using and promoting the platform for more than four (4) years now and I’m very passionate about it. Here I am pointing at problems and suggesting solutions. Feel free to jump in the discussion in the comments section!

   
 

What is a secret in the context of a Cloud Application?

A secret is any credentials giving access to something. Do I mean a password? Well, I mean a password, a username, an encryption key, a Share Access Signature (SAS), whatever gives access to resources.

A typical Cloud application interacting with a few services accumulates a few of those. As an example:

  • User name / password to authenticate against the Azure Access Control Service (ACS) related to an Azure Service Bus (you access more than one Service Bus namespace? You’ll have as many credentials as namespaces you are interacting with)
  • SAS to access a blob container
  • Storage Account Access key to access a table in a Storage Account (yes you could do it with SAS now, but I’m striking for diversity in this example ;) )

All those secrets are used as input to some Azure SDK libraries during the runtime of the application. For instance, in order to create a MessagingFactory for the Azure Service Bus, you’ll need to call a CreateAsync method with the credentials of the account you wish to use.

This means your application requires to know about the credentials: a weakness right there!

Compare this with a typical way you configure an application on Windows Server. For instance, you want an IIS process to run under a given Service account? You asked your favorite sys-admin to punch in the Service Account name & password into the IIS console at configuration time (i.e. not at runtime). The process will then run under that account and never the app will need to know the password.

This might look like a convenience but it’s actually a big deal. If your app is compromised in the Windows Server scenario, there is no way it can reveal the user credentials. In the case of your Azure app, well, it could reveal it. Once a malicious party has access to account credentials, it gives it more freedom to attack you than just having access to an app running under that account.

But it doesn’t stop there…

Where do you store your secret on your Azure app? %99 of the time, in the web.config. That makes it especially easy for a malicious party to access your secrets.

Remember, an application deployed in Azure is accessible by anyone. The only thing protecting it is authentication. If you take an application living in your firewall and port it to the cloud, you just made it much more accessible (which is often an advantage because partners or even your employees, from an hotel room, have access to it without going through the hoops of VPN connections) but are also forced to store credentials in a less secure way!

On top of that, in terms of management, it’s a bit awkward because it mixes application parameters with secrets. Once a developer deploys or creates a deployment package to pass it to the sys-admin (or whoever plays that role, it might be a dev-ops developer, but typically, not everyone in the dev group will know about production credentials), it must specifies some arbitrary config keys the sys-admin must override.

So in summary, we have the following issues:

  • Application knows secrets
  • Secrets are stored in an unsecure way in the web.config
  • Secrets are stored with other configuration parameters and do not have a standard naming (you need to come up with one)

 

Ok. How do we fix it?

This one isn’t easy. Basically, my answer is: in the long run we could but cloud platforms haven’t reached a mature enough level to implement that today. But we can establish a roadmap and get there one day with intermediary steps easing the pain along the way.

Basically, the current situation is:


That is, the app gets credentials from an unsecure secret store (typically web.config) then request an access token from an identity / token provider. It then uses that token to access resource. The credentials aren’t used anymore.

So a nice target solution would be:


Here the application requests the token from Windows Azure (we’ll discuss how) and Azure reads the secrets and fetch the token on behalf of the application. Here the application never knows about the secrets. If the application is compromised, it might still be able to get tokens, but not the credentials. This is a situation comparable to the Windows Server scenario we talked above.

Nice. Now how would that really work?

Well, it would require a component in Azure, let’s call it the secret gateway, to have the following characteristics:

  • Have access to your secrets
  • Knows how to fetch tokens using the secrets (credentials)
  • Have a way to authenticate the application so that only the application can access it

That sounds like a job for an API. Here the danger is to design a .NET specific solution. Remember that Azure isn’t only targeting .NET. It is able to host PHP, Ruby, Python, node.js, etc. . On the other hand, if we move it to something super accessible (e.g. Web Service), we’ll have the same problem to authenticate the calls (i.e. requirement #3) than how we started.

I do not aim at a final solution here so let’s just say that the API would need to be accessible by any runtime. It could be a local web service for instance. The ‘authentication’ could then be a simple network rule. This isn’t trivial in the case of a free Web Site where a single VM is shared (multi-tenant) between other customers. Well, I’m sure there’s a way!

The first requirement is relatively easy. It would require Azure to define a vault and only the secret gateway to have access to it. No rocket science here, just basic encryption, maybe a certificate deployed with your application without your knowledge…

The second requirement is where the maturity of the cloud platform becomes a curse. Whatever you’ll design today, e.g. oauth-2 authentication with SWT or JWT, is guaranteed to be obsolete within 2-3 years. The favorite token type seems to be changing every year (SAML, SWT, JWT, etc.), so is the authentication protocol (WS-Federation, OAuth, OAuth-2, XAuth, etc.).

Nevertheless it could be done. It might be full of legacy stuff after 2 years, but it can keep evolving.

I see the secret gateway being configured in two parts:

  • You specify a bunch of key / values (e.g. BUS_SVC_IDENTITY : “svc.my.identity”)
  • You specify token mechanism and their parameter (e.g. Azure Storage SAS using STORAGE_ACCOUNT & STORAGE_ACCOUNT_ACCESS_KEY)

You could even have a trivial mechanism simply providing you with a secret. The secret gateway would then act as a vault…

We could actually build it today as a separate service if it wasn’t from the third requirement.

 

Do you think this solution would be able to fly? Do you think the problem is worth Microsoft putting resources behind it (for any solution)?

Hope you enjoyed the ride!