Sunday, March 08, 2009

Exploring Live Framework Notifications

In order to tune in to the data that matters to you, Live Framework lets you receive near-real-time notifications of changes by subscribing to objects and feeds.  Although the SDK documentation is fairly sparse at the moment, Viraj Mody wrote an excellent blog post describing how the notification system works, and John Macintyre also gave a great PDC session on notifications.  Being the curious sort, I still had many questions, so I dug in further and this blog post and ResourceClient library are the result.

Overview

Notifications are exposed in the high-level programming model through the ChangeNotificationReceived event.  Under the hood, this is made possible by queues, subscriptions, and notifications working together.  These three building blocks can be accessed through the resource-oriented programming model, enabling even more interesting scenarios.  Even if you have no desire to program at this level, understanding how notifications work can help you take fuller advantage of the high-level model.

Slide 9 from John Macintyre’s presentation shows how the pieces tie together.

SubscriptionsAndNotificationsSlide

The typical usage pattern looks like this:

  • Client creates a queue
  • Client subscribes its queue to one or more resources or feeds
  • Client polls the queue’s Notifications feed
  • A resource or feed changes
  • Subscription Service posts a notification to each of the queues that are subscribed to the resource or feed
  • Client’s poll returns with one or more notifications
  • Client takes the watermark of the most recent notification and posts it to the queue
  • Client takes some action based on the notifications such as requesting the newest version of the changed resource or feed
  • Client polls the queue’s Notifications feed…

After the initial queue creation, a minimum of three round-trips are necessary to complete each cycle.  Later we will see a trick for doing this in two round-trips.

Although the client has to poll the queue, this is actually more of a push than a pull.  The HTTP request stays “parked” at the server until a notification arrives or a timeout expires (just like relay binding’s HTTP “parked requests” in BizTalk Services .NET Service Bus).  This means that when a notification arrives, the server can immediately push it to the client on the HTTP response of the parked request, saving half a round-trip of latency.  Jeremy Mazner has more to say on the subject in this Channel9 thread.

So far I have described the behavior of the cloud LOE.  We will cover the client LOE’s slightly different behavior later.

Recovery from failures

In John’s diagram above, the Queue Service and Subscription Service are both in-memory soft state services.  This means it is possible for a Queue Service instance to go down resulting in queue loss (including all its notifications).  A Subscription Service instance can also go down, resulting in loss of subscriptions.  In both cases, the client will receive a specific error notification the next time it tries to poll the queue.

In the case of queue loss, it is the responsibility of the client to create a new queue and resubscribe to each resource.  In the case of subscription loss, the client is told “resources from these addresses lost subscriptions” and it just needs to resubscribe to each one using its existing queue.

As Viraj notes in his blog post,

A short summary of the solution is that in cases where one or several Queue and/or PubSub Servers go down, the system is able to detect exactly what happened and take remedial action to restore state in the cloud in cooperation with clients (because clients were the original source for all the transient data that was resident on those servers before they lost state).

What can you subscribe to?

From the high-level object model, only the following types are subscribable via ChangeNotificationReceived:

  • MeshObject
  • MeshDevice
  • LiveItemCollection (feeds)
    • Mesh.Devices
    • Mesh.MeshObjects
    • Mesh.News
    • MeshObject.DataFeeds
    • MeshObject.Mappings
    • MeshObject.Members
    • MeshObject.News
    • These may not work yet:
      • Contact.Profiles
      • LOE.Contacts
      • LOE.Profiles
      • Member.Profiles

If you use the resource-oriented programming model, you can subscribe to all of the resources behind the high-level objects as well as:

  • ApplicationResource
  • ApplicationInstanceResource

In addition to the high-level feeds, you can also subscribe to feeds for:

  • MeshObject.Activities
  • Applications
  • InstalledApplications

One quirk is that MeshObject isn’t subscribable from the local LOE, although you can still subscribe to the local MeshObjects feed.

It is useful to consider the MeshObject->DataFeed->DataEntry hierarchy in terms of what is and isn’t subscribable.

  • Mesh/MeshObjects/Subscriptions
  • Mesh/MeshObjects/{id}/Subscriptions
  • Mesh/MeshObjects/{id}/DataFeeds/Subscriptions
    • this is a “feed of feeds”
  • Mesh/MeshObjects/{id}/DataFeeds/{id}/Subscriptions
    • this doesn’t exist
  • Mesh/MeshObjects/{id}/DataFeeds/{id}/Entries/Subscriptions
  • Mesh/MeshObjects/{id}/DataFeeds/{id}/Entries/{id}/Subscriptions
    • this doesn’t exist

What notifications do you receive?

You would expect all resources and feeds to notify you when resources are created, updated, or deleted, but that isn’t the case.  Some resources and feeds don’t notify you when entries are updated, some aren’t subscribable, and some are read-only.  This varies between the cloud LOE and the local LOE and between various objects.  Here’s an incomplete listing of which notification triggers work based on my experimentation:

  Local Cloud
MeshObject Not subscribable Update, Delete
MeshDevice Not subscribable Nothing? at least not Update
MeshObjects feed Create, Update, Delete Create, Delete (no Update)
Devices feed Can’t update locally Update, maybe others
DataFeeds Create, Update, Delete Create, Update, Delete
DataEntries Create, Update, Delete (with double notifications for each) Create, Update, Delete

I’m not sure why, but a subscription to DataEntries creates two notifications for each change on the local LOE.

How subscriptions work

As you have seen, you can subscribe to any resource or feed that has a Subscriptions URL (ex: Mesh/MeshObjects/{id}/Subscriptions).  This Subscriptions feed only supports HTTP POST (no GET).  Individual subscriptions only support PUT (no DELETE).

Subscriptions have the following interesting properties:

  • NotificationQueueLink
  • ExpirationDuration
  • ResourceEntityTag

NotificationQueueLink is the only thing you need to include when you create your subscription.  This link should point to the SelfLink of the NotificationQueue you have created, not its NotificationsLink.

ExpirationDuration is expressed in seconds and is assigned a random number between 2700 and 4500 (45 min to 75 min) from the cloud LOE and a fixed value of 3600 (60 min) from the local LOE.  The internal class NotificationManager that is used to implement ChangeNotificationReceived has a hard-coded subscriptionRenewalInterval of 60 minutes, so it seems there’s a chance of cloud subscriptions being in an unrenewed state for up to 15 minutes, but I could be wrong.

ResourceEntityTag is the ETag of the resource when the subscription was created.  I believe a notification is sent each time the resource’s ETag changes.

Note that the subscription doesn’t have a link to the resource or feed you subscribed to.  It is important for you to maintain your own copy of the URI to the resource you subscribed to.  First, if you want to imitate ChangeNotificationReceived and execute different event handlers for different subscriptions, you will need this URI to demux from notifications to event handlers.  Second, you will need this URI to create new subscriptions when you receive an AllSubscriptionsLost notification.

Clients should track their subscriptions for a variety of reasons:

  • You can’t GET the subscription feed
  • You can’t GET a subscription via its SelfLink
  • For subscription renewal
  • To recover from queue or subscription loss

How notification queues work

Notification queues are created by posting to Mesh/NotificationQueues.  If using AtomPub, you can simply post an empty entry:

<entry xmlns="http://www.w3.org/2005/Atom"/>

You can’t GET Mesh/NotificationQueues to see which queues exist, so you will need to maintain a reference to the queue you get back.

Queues are intended to only have a single consumer (one queue per client), and that consumer’s queue usage should effectively be single-threaded.

There are a few important properties on queues:

  • ExpirationDuration
  • Watermark
  • SelfLink
  • NotificationsLink

ExpirationDuration is expressed in seconds and is 300 (5 min) from the cloud LOE and 600 (10 min) from the local LOE.  ExpirationDuration is the duration after which the notification queue expires if its Notifications feed isn’t polled.

Watermark is used together with PUT to remove all notifications in the queue with watermarks less than or equal to the watermark you sent.  Updating a watermark is destructive.  You can’t PUT an earlier watermark to roll back the queue, and that’s fine with me.

SelfLink is the URI you use for a new subscription’s NotificationQueueLink.  You also PUT to the SelfLink URI when updating the queue’s watermark.  SelfLink is PUT only.  You can’t GET or DELETE a queue.

NotificationsLink is the feed where new notifications appear.  This feed is not your typical feed.  As I mentioned earlier, polling the NotificationsLink “parks” the HTTP request on the server until a notification appears or a timeout expires.  This timeout varies between 25 and 30 seconds for the cloud LOE and is hopefully low enough so that intermediary proxies don’t prematurely close the connection.  Requests will return immediately if the queue already contains notifications.  Unfortunately, requests to the NotificationsLink on the local LOE always return immediately whether or not the queue is empty.  This is a significant problem not only because the programming model is inconsistent, but because it means that in order to get the same low latency as the cloud LOE you have to hammer the snot out of the local LOE, pegging the CPU in the process.  Hopefully this gets fixed.  The SDK sidesteps this issue by only polling every 5 seconds, but that loses much of the benefit of push notifications.

The notifications feed behaves strangely when it contains more than one notification.  Sometimes it displays all of the notifications (up to 10), and sometimes it only displays the first one.  For example, you can poll the queue and see 4 notifications.  You can poll it again and perhaps only see one.  Polling a third time might show all 4 again.  At least it always displays notifications in order (oldest first).  This means you shouldn’t count on seeing more than one notification at once, but be aware that it is possible.  Simply act on all of the notifications you receive, PUT the watermark of the last one, and poll the queue again.  Using this technique you will eventually see all of the notifications in the queue, even if it appears there is only one or if the queue appears to be clipping the results at 10 notifications.

Queues are semi-private.  They can’t be seen by other user accounts and can’t be shared.  Since you can’t enumerate queues with GET and their URLs are randomly generated, the only way another app or client within the same user account can see your queue is if you choose to share the queue’s URI, although I can’t think of any good reason to do that other than for testing.

Queue loss

Queues can be lost either by failing to poll them within their ExpirationDuration or by the unexpected death of a queue manager in the LOE.  You will find out the queue is gone when you poll the queue’s notifications feed and receive an AllSubscriptionsLost notification.

You might be wondering, if the queue is gone, then how can I poll its notifications feed?  It turns out that if a queue doesn’t exist at a particular URI, a new “dead” queue will be automatically created at that URI.  You can see this by connecting to the cloud LOE with the LiveFX Resource Browser and visiting the following URI:

Mesh/NotificationQueues/1234-1234-1234/Notifications

Note that this only works on the cloud LOE.  The local LOE will return an error.  Apparently you shouldn’t expect to lose a queue on the local LOE.

A weird bit of trivia is that you can create two different queues with identical URIs in two different user accounts.  I’m guessing this is because queue managers are allocated per-user.

Another weird bit of trivia is that the AllSubscriptionsLost notification has a watermark that increments each time a subscription attempts (and fails) to post a notification to the queue.  This is one reason why the name AllSubscriptions lost is misleading and ought to be renamed to something like QueueLost.  The subscriptions are most certainly still alive.

How notifications work

You can’t create notifications directly.  In other words, you can’t POST a new notification to a queue’s Notifications feed.  Notifications are only created by subscriptions (and queue or subscription loss).

Notifications have the following interesting properties:

  • NotificationType
  • ResourceLink
  • Watermark

NotificationType can be one of the following:

  • ResourceChanged
  • SubscriptionLost
  • AllSubscriptionsLost
  • System

ResourceChanged is what you will see as the result of subscribing to a resource.  SubscriptionLost is received when a Subscription Service unexpectedly dies on the cloud LOE.  SubscriptionLost is not received when a subscription expires due to its ExpirationDuration reaching zero.  As mentioned earlier, AllSubscriptionsLost would probably be better named QueueLost.  I’m not sure if System is used anywhere in the public API.  Perhaps it’s used for device connectivity.

ResourceLink points to the feed or object that changed, or in the case or SubscriptionLost, I believe it points to the resource whose subscription was lost.

Watermark is a counter string that increases with each new entry.  On the cloud LOE these look like “1.248.0”, “2.248.0”, “3.248.0”.  On the client LOE they are simply “1”, “2”, “3”.

Notification SelfLinks are incrementing integers on the cloud LOE and GUID-like strings on the local LOE.  You can’t GET a notification using its SelfLink.  You can only see notifications by polling the notifications feed.  You also can’t do queries on notification feeds.  If you try to use something like $skip or $top, you will get an AllSubscriptionsLost notification and every notification in the queue gets discarded without you ever seeing them.  However, the subscriptions still work and the queue continues to function normally afterward (other than the data loss).

Notifications and expansions

A cool trick for saving a round-trip is to use expansions to return changed resources and feeds inline in the notification results.  A not-so-cool potential side-effect is that if the notifications feed returns more than one notification for the same resource or feed, the expansion will result in duplicate expanded data, wasting bandwidth.

If you could combine watermark updates with the next poll request, you could get the poll-watermark-update cycle from 3 round-trips down to just 1 round-trip.

Notifications and activities

When you combine notifications, expansions, activities, and the cloud LOE, this enables near-real-time messaging between clients.  By subscribing to an Activities feed, clients can poll the Notifications feed using $expand and receive complete activity entries as soon as someone posts a new activity.  The latency in this scenario is half a round-trip to post the activity plus half a round-trip to receive the activity through your parked HTTP request to the notifications feed.  Since notifications, subscriptions, and activities all use in-memory stores, this should have quite good performance.  There are a number of issues with this technique that I’ll cover in a future blog post, but it is promising for near-real-time communications.

Sync bypasses update notifications

Sync appears to bypass notification of updated feed entries.  Specifically, if I subscribe to the MeshObjects feed on the local LOE, I am notified when sync from the cloud causes a MeshObject to be added or removed from the feed, but I am not notified if sync causes a MeshObject to be updated.  I haven’t experimented with other feed types, but the same issue might exist with feeds such as DataEntries.

Trigger support

You can use Create and Update triggers on subscriptions and queues.  Delete triggers aren’t persisted and therefore won’t work.  You might use this to POST a queue and create subscriptions for it in its PostCreateTrigger, saving a round trip.  Of course these subscriptions wouldn’t be tracked and managed by the SDK.  You could also create a MeshObject and add a subscription to it in the MeshObject’s PostCreateTrigger.

Updating resources and feeds through Resource Scripts or triggers doesn’t bypass notifications.

Client vs. Cloud

The client LOE waits for data to sync to it from the cloud before notifying you of any changes.  This can take a while, so if you want quick notifications, be sure to subscribe to the cloud LOE, not the local LOE.  It would sure be nice if local subscriptions caused the local LOE to subscribe to the same resource in the cloud (if connected), taking advantage of push notifications to achieve the same latency for local subscriptions as if you were subscribed directly to the cloud LOE.

It is probably stating the obvious at this point, but I think it’s worth noting that queues are one-way from cloud to client.

Notification delays

Although notifications are normally posted to queues immediately, it is possible for notifications to be delayed if there is a large backlog generated by lots of rapid updates to a subscribed resource or feed.  You might see a few updates trickle in, then 15 seconds later another batch of updates appears, and so on, for several minutes.

Device connectivity

Supposedly device connectivity uses the subscription and notification services under the hood as a signal channel for P2P session establishment.  It may be possible to see this in action and even participate in the process, but I haven’t explored this.  There must be a reason you can subscribe directly to individual MeshDevices, but I haven’t seen anything interesting pop up yet.  Check out George Moore’s description of P2P notifications and file sync for a fascinating scenario that I’m not sure is possible with the current CTP.

Other transports

Supposedly there is a TCP transport for receiving push notifications, but it doesn’t appear to be used or available at the moment.  This could be useful for chaining subscriptions from the local LOE to the cloud LOE (or other clients) while preserving an HTTP programming experience for developers.

The high-level programming model

If you have managed to read this far, you should now have a healthy appreciation for the services provided by the high-level programming model.  At this level, all we see is the ChangeNotificationReceived event on feeds, MeshObject, and MeshDevice.  You simply subscribe to this event on the appropriate object and your event handler will be called when entries are created (on feeds only of course), updated, and deleted.

Here are some of the gory details it takes care of for you:

  • Queue creation
  • Queue polling
  • Updating the watermark
  • Subscription creation
  • Subscription renewal
  • Recovering from queue loss
  • Recovering from subscription loss
  • Demuxing from notifications to event handlers

Let’s dig into that last one a bit deeper.  If you use Reflector to examine how ChangeNotificationReceived works, you will see that each object subscribes to receive any and all notifications that arrive on the queue.  When these notifications are received, each object iterates through all notification entries, checking to see if the notification’s ResourceLink matches its own SelfLink.  If there is a match, the object raises the ChangeNotificationReceived event and stops iterating the notification list, effectively filtering out multiple notifications for itself that might have been received in a single response.  It appears that if the client’s LiveOperatingEnvironment is configured with AutoLoadRelationships, the object will then be reloaded, after the event is raised, meaning that if you examine the object in your event handler, it may not contain the latest changes.  This is reported in the forums here and here and is supposed to be fixed in the next release.  Those threads also mention a performance issue with notifications when using the Silverlight library that will be fixed in the next release.

Programming at the Resource level

As part of my explorations, I wrote a library called ResourceClient that makes it easier to work directly with Resources such as queues, subscriptions, and notifications.  Here’s a brief example of the syntax it enables.

using (new ResourceClientContext(username, password))
{
    var queue = Uris.Cloud.NotificationQueues.Post(new NotificationQueueResource());
    queue.StartAutoPoll((notifications, context) =>
    {
        if (notifications.Entries.Count() > 0)
            Console.WriteLine(notifications.ToAtomString());
    });
    Uris.Cloud.MeshObjects.Subscribe(queue);
    var mo = Uris.Cloud.MeshObjects.Post(new MeshObjectResource("My Object"));
    mo.DataFeedsLink.Subscribe(queue);
    var feed = mo.DataFeedsLink.Post(new DataFeedResource("My Feed"));
}

This code writes to the console Atom-formatted notifications for the new MeshObject and the new DataFeed.  I will cover the details of this library in another blog post.  For now I will just mention that in addition to being a generic resource-oriented API, it contains notification-specific helpers for automatic queue polling, manual polling, sending watermarks, subscribing, and dispatching to different event handlers for each subscription.

MeshNotificationPlayground sample app

I have written a little WPF app that demonstrates queues, subscriptions, notifications, and activities in action.

MeshNotificationPlayground

You can create a queue, poll it, copy the queue’s URL to the clipboard for pasting into Resource Browser, select a notification and send its watermark, select a MeshObject and subscribe to its Activities feed, and create new Activities for the selected object.

Here are some things to try:

  • Poll an empty queue and see that the request returns after 25 to 30 seconds
  • Poll the queue with the WPF app and Resource Browser at the same time, notice that they both wait, then cause a new notification and see that both requests return immediately
  • Subscribe to multiple Activities feeds, create activities in each of them, and notice the resulting notifications have different ResourceLinks
  • Send a watermark that isn’t the last watermark and see that the queue only empties up to the watermark you sent
  • Wait 5 minutes and poll the queue to see an AllSubscriptionsLost notification
  • After receiving AllSubscriptionsLost, create more activities for subscribed feeds and see that AllSubscriptionsLost’s watermark increases each time
  • Click Refresh Activities List and watch MaxAge count down for each activity. Notice the random MaxAges.  When a maxAge reaches zero, the activity disappears.  See that this causes a new notification.
  • Create more than 10 notifications and see that no more than 10 are returned.  Send the latest watermark, poll again, and see that the remaining notifications appear.
  • Poll the queue repeatedly when it has multiple notifications and see that sometimes only 1 notification appears.
  • Click Create Activity many times in a row and see that notifications for that Activities feed continue to trickle in several minutes later.

You can download the code here.  It includes and uses the ResourceClient library.

Comparison to iPhone’s Push Notifications

If for some reason your eyes haven’t glazed over yet, check out Viraj’s analysis of the iPhone Push Notification Service.  If you read between the lines, this is a fascinating compare-and-contrast to Live Mesh’s notification solution, with Live Mesh being better in many ways.  It hints that Microsoft might imitate Apple and use notifications to mine usage metrics for apps.

Apple’s solution is slightly more efficient in that it creates a single channel from the cloud to each device, whereas Live Framework currently requires each app to establish its own channel.  This could be solved by having apps subscribe to the local LOE using local queues which would then transparently chain the subscriptions and queues through a single channel to the cloud LOE.

Conclusion

Hopefully this helps clarify how you should expect notifications to behave in your apps, as well as providing ideas for more creative uses of the building block features of queues, notifications, and subscriptions.  I plan to follow up with more details on my ResourceClient library, and write up a feature request that combines the best parts of notifications and activities to enable better real-time communication between clients.

4 comments:

Anonymous said...

Thanks so much for a detailed and interesting account of Live Framework notification. The SDK writers would do well to document these areas.

Aaron Couch said...

Oran,

I really liked your video. I like your blog to! I have a couple of my own. I too use live Mesh and find it very useful. It saved me one time when my laptop wouldn't turn on to a dead battery and a broken power cord the night before my report was due. Luckily I was able to recover all my data off my dad's computer on the Live Desktop. I did have a question for you. I am unable to get to http://developer.mesh-ctp.com and I'm not sure why. Any thoughts?

Oran Dennison said...

Thanks Aaron, glad you liked it. Unfortunately there have been a number of changes to the direction Mesh is heading since I last posted.

Live Mesh is being replaced by Windows Live Sync, and Live Mesh will be turned off approximately 6 months after Windows Live Sync is officially released. :-(

developer.mesh-ctp.com was turned off September 8th last year. There isn't a direct replacement for it at this point, although it appears it will likely be replaced by a combination of Windows Live Sync and Messenger Connect which is in limited beta right now.

Aaron Couch said...

I know. I heard and I'm not happy about that. It's sad really. They have such an awesome opportunity. If they could combine Live Mesh with Skydrive and give users 30gb (since it's ALREADY available) and sync with my folders for free like I already have it doing in a round about way. I wouldn't have to use two services and I could uninstall the rest of my online backup programs. That would certainly build loyalty to Microsoft. Apparently they value "space" more than that. Oh well. SugarSync and Dropbox I guess will remain my main sources after Mesh goes away.