Honing the craft.

You are here: You're reading a post

Migrating to Python 3 on Google App Engine - Part 2 - The Datastore

This is Part 2 of my Migrating to Python 3 on Google App Engine blog series, about migrating to the newest Google Datastore client API. I share my migration experience from the App Engine DB API to NDB and finally Cloud NDB.

The migration all the way to Cloud NDB, that is compatible with the App Engine Python 3 Standard Evironment is a multi-stage process. These were my steps on a high level:

  • Migrate to App Engine NDB

    • Replace google.appengine.ext.db imports with google.appengine.ext.ndb
    • Update my code to make it compatible with the NDB API.
    • Test locally using dev_appserver.py.
    • Deploy on App Engine and test there.
  • Migrate to Cloud NDB

    • Replace imports and library loading code to import the Cloud NDB library.
    • Wrap my WSGI app in a middleware providing the Cloud NDB client context.
    • Set up Datastore Emulator and test locally with dev_appserver.py.
    • Upload indexing configuration generated by the Datastore Emulator.
    • Update my App Engine service account to give permissions to access the Datastore.
    • Deploy to App Engine and test there.

While the first part, where I was migrating to the NDB API was chiefly about updating my models and other client code interacting with the Datastore, migrating to Cloud NDB is mainly about configuring the environment and less about code changes in your actual application. In both cases, you are accessing the same data in your Google Cloud Datastore, so the process fortunately doesn't involve a data migration.

Although the documentation is pretty good, there were some issues I have ran into, that I will touch on and share my solutions with you.

Migrating from the DB API to NDB

Migrating to the NDB API from DB is all about making changes in your application code to make it compatible with the new API. In my case, I have client code accessing the API in two places:

  • In my models module defining the model classes.
  • My web application views that obtain data using those models.

Admittedly, this post is about my migration journey, and while it is likely, that you will also need to take the steps explained here, there is also other ground you probably need to cover, hence it is highly recommended that you also consult the official documentation about the subject.

The models module

Since I only have a few kinds of entity classes, like Post and Tag, they are all located in a single module. The next section describes the changes I have made in this models module.

Changing the import statement and dependent references

The first thing, that I have done was updating the import statement in my module containing the model definitions:

# I changed the import from this
from google.appengine.ext import db
# to this
from google.appengine.ext import ndb

Of course, if you are really lazy and don't want to update existing references in you model, you could do this:

from google.appengine.ext import ndb as db

But since my IDE does the refactoring needed in a sweep, I rather changed the references for clarity's sake. For example:

class Post(db.Model):
    title = db.StringProperty()
    content = db.TextProperty()

turned into:

class Post(ndb.Model):
    title = ndb.StringProperty()
    content = ndb.TextProperty()
Update deprecated property types

Some of the property classes in the DB API have been superseded by new ones. I had to update these with their replacements in the NDB API. In my models, I had two of those kinds of properties, one was db.ReferenceProperty:

post = db.ReferenceProperty(Post)

had to be replaced with:

post = ndb.KeyProperty(kind=Post)

where Post is the referenced model class.

The other was db.ListProperty(db.Key):

tags = db.ListProperty(db.Key)

had to be replaced with:

tags = db.KeyProperty(repeated=True)

Using models in queries with NDB

Now that the models have been updated for the NDB API, I also had to make changes in views communicating with these models.

With DB, you use the all() model method to get a query object used to filter and ultimately collect the results:

myModel = Model()
q = myModel.all()

this has to be changed to:

myModel = Model()
q = myModel.query()

to use the query() method instead.

Filtering and ordering results

Filtering has also changed significantly: from pattern strings used in DB, you now have to use the model classes when defining the filter conditions. For example, I had to change this:

posts = Post()
qPosts = posts.all()
qPosts.filter('published =', True)
qPosts.order('-createDate')

to this:

posts = Post()
qPosts = posts.query()
qPosts = qPosts.filter(Post.published == True)
qPosts = qPosts.order(-Post.createDate)

The arithmetic and comparison operators for these classes have been overloaded, so that you can use them in these filter contexts.

There is one more thing here though. Notice that, previously, I have created the query object and then just called the methods on that object to refine the query further. With the new API however, the query object the method is called on is not modified, instead, calling the method will return a modified query object, so you have to carry that forward if you want to use the filters and ordering, that you have invoked.

Keys and IDs

Getting an integer represented ID became more explicit. With the DB API, I was reading an entity's ID like this:

tagId = tag.key().id()

and I had to change it to this:

tagId = tag.key.integer_id()

Where tag is a model object instance.

Note two things:

  • I just read the entity's key using the key property, it is no longer a method.
  • The method reading the ID explicitly tells you, that it will be represented as an integer. (You can also get the string representation using key.string_id(), which supersedes key.name().)
Count doesn't support offset

Although the documentation doesn't mention this, the offset parameter is no longer supported by the count() method. Hence I had to change this code used for pagination:

postsOnThisPage = qPosts.count(offset=(page-1)*postNum, limit=postNum)

to this:

qPosts.count() > page*postNum

Actually, it is a lot simpler this way, so I didn't mind this too much. q.fetch() still supports offset, so there is no change there in this regard.

Query results

To conveniently display query results, I am passing query result objects as template variables to my Jinja templates. However the API for getting the query results has also changed. So this result passed into the template:

qPosts.run(offset=(page-1)*postNum, limit=postNum)

had to be changed to this:

qPosts.iter(offset=(page-1)*postNum, limit=postNum)

which returns an iterator, that you can loop over in the template.

Getting multiple entities with keys

MyModel.get_by_id() no longer takes sequences or objects supporting the iteration protocol:

qTags = Tag.all()
postTags = Tag.get_by_id(item.id() for item in post.tags)

Using this generator expression here is no longer valid; you have to use the ndb.get_multi(keys) utility function to get multiple entities using keys. I wanted to avoid importing NDB into my views and wanted to keep Datastore concerns in my models module, so I did this in two steps.

First, I have created a class method in the model class where I needed this functionality:

class Tag(ndb.Model):
     # properties
     .
     .
     .
     @classmethod
     def get_taglist(cls, keys):
         return ndb.get_multi(keys)

Then I can load multiple entities with keys like this:

qTags = Tag.query()
postTags = Tag.get_taglist(post.tags)
Deleting an entity

With the DB API, you can call the delete() method on the entity:

entity.delete()

with NDB, this method has moved to the entity key object:

entity.key.delete()
Keys vs. entities

NDB is picky about types, with the DB API, it was OK to pass an entity where a key was required, this is no longer true:

post = Post()
# This doesn't work with NDB
post.category = Category.get_by_id(category_id)
# I had to do this instead
post.category = Category.get_by_id(category_id).key
The entity save method has been deprecated
# This doesn't work
post.save()
# Use this instead
post.put()

Migrating to Cloud NDB

My account of migrating to Cloud NDB follows. The official documentation is pretty good and gives a comprehensive overview of the process with many references. Don't forget to check it out before moving forward with any changes.

API changes are small, but they exist

Compared to the migration from the DB to the NDB API, where you need to make many changes to accommodate your client code, migrating to Cloud NDB doesn't change the API too much, but there are some API changes, that you need to be aware of. These changes are mainly related to the portions of the NDB API that rely on App Engine Python 2.7 runtime-specific services.

As far as my application is concerned, I didn't have to make any changes to my code apart from importing and loading the Cloud NDB library, which I will cover in a minute.

No App Engine Memcache

App Engine NDB uses the App Engine Memcache service to cache data, however Cloud NDB no longer uses Memcache. If you want in memory caching, you need to set it up separately.

I didn't feel like I need this caching for my low traffic, small database blog application, so this step is not covered.

It is worth noting, that Memorystore for Redis doesn't have a free tier.

Importing the Cloud NDB client library

In order to import the Cloud NDB client library, in the Python 2 Standard Environment, you have to add some extra statements the appengine_config.py bootstrap script. I already had my external libraries bundled with my application loaded from the pylibs directory, but I had to import the pkg_resources module and call pkg_resource.working_set.add_entry(path) to make importing the google.cloud.ndb package possible. This is how the relevant snippet of my appengine_config.py looked like:

import pkg_resources
from google.appengine.ext import vendor
# Add any libraries installed in the pylibs directory.
path = 'pylibs'    
vendor.add(path)    
pkg_resources.working_set.add_entry(path)        
# Then load NDB    
from google.cloud import ndb

Where pylibs is the directory in the project root where my bundled libraries are stored. I have also imported ndb as I have defined my WSGI middleware creating the NDB client context in appengine_config.py as well. Of course if you define your middleware in a different module, you don't have to import ndb in appengine_config.py.

Creating the NDB client context middleware

One of bigger changes when migrating to Cloud NDB is, that with the NDB library being built into the runtime, you can just import ndb and start using it in your code.

However with Cloud NDB things are a bit trickier: you have to instantiate the ndb.Client class and the resulting object will hold all the context related to the connection to the Google Cloud Datastore. This is simple enough, but then all your code interacting with the NDB library has to run in the context generated from this client object.

The migration documentation for Cloud NDB fortunately gives a few strategies with examples for different kinds of applications, including Flask, that can be generalized for pretty much any WSGI compliant web application. If you have a high level understanding of how a WSGI application is invoked by a web server, it should be easy to understand the middleware needed.

In essence, a WSGI compliant web application is simply a callable object (function, method, class, or an instance with a __call__ method) that takes two arguments. The WSGI compliant web server calls this callable when a request is processed with the first argument being a dictionary object containing CGI-style environment variables, that mainly contain information about the request, it is customary to call this argument environ. The second argument is a callable, that should be called by the application before returning the response payload. This second argument is usually called start_response.

I don't want to go into further details about this, if you need an accurate account of how this works, I recommend you to read PEP 3333 that has the latest specification of the WSGI.

The good thing about this clearly specified interface, is that you can wrap an existing WSGI application into another callable and add extra functionality by manipulating the arguments passed by the web server or the response returned by the application. This is fine, as long as you expose the same WSGI compliant interface to the server.

This wrapper is called a middleware in WSGI parlance and, in my case, it was a great way to inject the Cloud NDB client context, so that the whole application runs in that context and any part of it can communicate with the Datastore. This is how my implementation looks like:

# Instantiate the client that will be used to create the context
client = ndb.Client()

class NdbWsgiMiddleware:
    """Wraps the WSGI application into the NDB client context."""
    def __init__(self, wsgi_app):
        self.wsgi_app = wsgi_app

    def __call__(self, environ, start_response):
      with client.context():
          return self.wsgi_app(environ, start_response)

I have to pass in the original WSGI app when instantiating this middleware and when the resulting object is called, it will just propagate the call to the original application object and it returns the value the application callable returns, but it does so in the NDB context, so the application will have access to this context. This is how it is used, when instantiating the application (I was still using Webapp2 at this point.):

from appengine_config import NdbWsgiMiddleware

app = NdbWsgiMiddleware(webapp2.WSGIApplication())

Of course this middleware could just be a function, the only reason why I defined it as a class is that I wanted to keep a reference to the original object, so that I can set up the routes and other configuration options after the instantiation, eg.:

app.wsgi_app.router.add(webapp2.Route('/blog/tags', handler=AllTags, name='tags'))

This is not neccessary for Flask, you could just use a function as your middleware and use the app.wsgi_app property provided by Flask to hook your Cloud NDB middleware in.

Setting up Datastore Emulator for local development

While using the DB and NDB API-s baked into the App Engine runtime, Datastore emulation built into the development server can be used. With Cloud NDB however, you either have to use the your Datastore environment in Google Cloud, or the Datastore Emulator. You can install the latter as a component with the gcloud command:

gcloud components install cloud-datastore-emulator

alternatively, if you have used your favorite Linux package manager to install the Google Cloud SDK, you should install the Datastore Emulator component with that. For example, in Debian:

sudo apt-get update && sudo apt-get install google-cloud-sdk-datastore-emulator

Then you can start the emulator with:

gcloud beta emulators datastore start

Once it is running, you can read the environment variables exposing the emulator runtime details, by running the following command:

gcloud beta emulators datastore env-init

The documentation recommends to either set these environment variables in the client applications's shell environment by running $(gcloud beta emulators datastore env-init) or to export them manually. The problem is, that dev_appserver.py doesn't read these from the shell environment, so setting them there doesn't work. The workaround for me was to set these variables in appengine_config.py before the Cloud NDB Client object is instantiated:

import os

if os.getenv('SERVER_SOFTWARE', '').startswith('Google App Engine/'):
    pass
else:
    # The local datastore emulator details are not
    # read from the shell environment, hence  I need to
    # add them here manually.
    os.environ['DATASTORE_DATASET'] = 'redacted'
    os.environ['DATASTORE_EMULATOR_HOST'] = 'localhost:8081'
    os.environ['DATASTORE_EMULATOR_HOST_PATH'] = 'localhost:8081/datastore'
    os.environ['DATASTORE_HOST'] = 'http://localhost:8081'
    os.environ['DATASTORE_PROJECT_ID']= 'redacted'

This snippet uses the SERVER_SOFTWARE environment variable to determine if the application is running on a Google App Engine instance, in which case it doesn't do anything, otherwise, it sets the Datastore environment variables pointing to the local Datastore Emulator. Once again, this code has to run before the Cloud NDB client is instantiated. You can find further related information in this GitHub issue.

Issues with the Six package dependency

When I first attempted to run my application connecting to the Datastore Emulator, I ran into some issues when the system tried to import the ndb package. Traces led to the six package, which is a dependency of ndb. After digging around a bit, I found this GitHub issue and managed to resolve the error by manually importing and reloading six before importing ndb as one of the comments suggested:

import six; reload(six)
from google.cloud import ndb

If you encounter similar issues, try the snippet above, it might help you as well. Note: After I migrated to the Python 3 environment, this workaround was no longer needed, so this was temporary.

Updating your App Engine permissions for accessing the Datastore

I did a test deployment on App Engine and the application wasn't able to connect to the Datastore. After digging around, it turned out from this GitHub issue that I have to update the permissions of my App Engine application's service account, so that it has access to the Datastore service.

Generating indexes for more complex queries

After migrating to Cloud NDB, for complex queries you need to have the right indexes in place in your datastore. This is a new requirement I wasn't aware of, so I ran into this issue.

When you test your application with the Datastore Emulator, there is an index.yaml file that is auto-generated in the emulator's project directory and the emulator generates the correct index configuration while you are running the queries, that need these indexes for the live environment. After you are done, you can use the gcloud utility to upload the index configuration to the Datastore.

This means however, that I had to test every nook and cranny of my application, to make sure that all the possible queries are run locally and that all indexes are created. This might not be easy to do for larger applications without a test automation suite, so keep this in mind, and maybe it is the right time to introduce automation in your worklfow. 😉

The next part

In the next part of this series, I will share the steps of migrating to Pyramid with you.