Serving Files from MongoDB GridFS - Jan Švec

In many projects we use the schemaless database MongoDB. It provides the flexibility we needed, because many details of data representation became clear only during the projects. Besides mapping JSON, or BSON, objects elegantly into Python dictionaries through pymongo, MongoDB also provides the gridfs module for accessing objects stored through GridFS. In this note we will look at how to serve these files through an HTTP server written in Python.

GridFS

GridFS is a specification for storing binary files in MongoDB. Files are split into chunks stored separately. This makes it possible to access only parts of a file during reading, which is useful for large files. Chunks are 255 kB by default. If we want to access files stored in the fs bucket, GridFS creates two MongoDB collections: fs.files and fs.chunks. The first stores file metadata, while the second stores the chunks. Besides standard metadata, such as length, upload date, or MIME type, arbitrary custom metadata can be stored in the BSON document and queried later.

In Python, GridFS is accessed through the gridfs module, which provides a file-like interface over database objects. For much of the Python standard library, GridFS files look like ordinary filesystem files and have methods such as read(), close(), and seek().

The goal is to show how to serve GridFS files to an HTTP client application.

Non-Python Solutions

Besides a Python solution, several alternatives exist. In principle, they either map GridFS into the filesystem using FUSE or provide a web-server module, for example for nginx.

At the time, none of these web-server modules was available in Debian. Therefore we extended an existing HTTP server written in Python to handle GridFS files.

Python Solution: web2py

I had already mentioned streaming GridFS files in web2py in an older note on abclinuxu.cz. Because a GridFS file is a file-like object in Python, it can be streamed directly from web2py. In the model, create a database connection and a GridFS instance:

db = MongoClient(settings.db_uri).stitky123
fs = gridfs.GridFS(db, 'files')

In the controller:

def download():
    id = request.args[0]

    id = uuid.UUID(id)
    db_fr = fs.get(id)

    response.headers['Content-Type'] = db_fr.content_type

    return response.stream(db_fr, 1024)

Python Solution: Tornado

The project that needed GridFS file serving was built on the Tornado web server. Tornado works in asynchronous non-blocking I/O mode and is therefore fast.

Connecting GridFS with Tornado is simple in principle: implement a custom GET handler that obtains a GridFS file based on the path and serves it gradually. Because we needed to serve audio and video content, and therefore support Accept-Ranges requests, I chose another route. Tornado already contains StaticFileHandler with support for these requests, where only the requested byte range is sent.

The work was to derive a new handler from StaticFileHandler and reimplement methods so that gridfs simulates filesystem access.

Implementation

Imports:

from tornado.ioloop import IOLoop
from tornado.httpserver import HTTPServer
from tornado.web import StaticFileHandler, HTTPError, Application
from pymongo import MongoClient
import gridfs
from bson.objectid import ObjectId, InvalidId

The new handler:

class GridFSHandler(StaticFileHandler):
    def initialize(self, fs):
        self.root = ''
        self.fs = fs
        self.fr = None

    def get_version(self, url_path):
        return None

    def get_content(self, abspath, start=None, end=None):
        if start is not None:
            self.fr.seek(start)
        if end is not None:
            remaining = end - (start or 0)
        else:
            remaining = None
        while True:
            chunk_size = self.fr.chunk_size
            if remaining is not None and remaining < chunk_size:
                chunk_size = remaining
            chunk = self.fr.read(chunk_size)
            if chunk:
                if remaining is not None:
                    remaining -= len(chunk)
                yield chunk
            else:
                if remaining is not None:
                    assert remaining == 0
                return

    def get_content_type(self):
        return self.fr.content_type

    def get_content_size(self):
        return self.fr.length

    def get_modified_time(self):
        return self.fr.upload_date

    def get_absolute_path(self, root, path):
        try:
            return ObjectId(path)
        except InvalidId:
            raise HTTPError(404)

    def validate_absolute_path(self, root, absolute_path):
        try:
            self.fr = self.fs.get(absolute_path)
        except gridfs.NoFile:
            raise HTTPError(404)
        return absolute_path

    def get_content_version(self, abspath):
        return abspath

    def compute_etag(self):
        return self.fr.md5

Implementation notes:

During initialisation, the only parameter is the GridFS object pointing to the bucket where files are stored.
Metadata such as content_type, content_size, and modified_time comes directly from the MongoDB object in the files collection.
get_absolute_path() returns an ObjectId instead of an absolute path.
validate_absolute_path() opens the GridFS file based on the ObjectId; if it does not exist, it returns 404.
The file’s MD5 stored in the database is used as the ETag.
get_content() is only a modification of the original StaticFileHandler method.
A GridFS file opened for reading does not need to be closed; GridOut.close() in the gridfs module is empty.

The rest of the code for starting the HTTP server:

if __name__ == "__main__":
    db = MongoClient('mongodb://localhost/db').get_default_database()
    fs = gridfs.GridFS(db, 'data')

    application = Application([
        (r"/data/(.*)", GridFSHandler, {'fs': fs}),
    ])

    server = HTTPServer(application)
    server.listen(8888)
    IOLoop.instance().start()

Conclusion

First, upload a file into the MongoDB GridFS bucket named data; see the GridFS documentation for details. The file can then be downloaded from a URI such as http://localhost:8888/data/53f64dba1608c0780e7dcaad, where 8888 is the server port and 53f64dba1608c0780e7dcaad is the file’s ObjectId.

The code above is simple, but it enables straightforward serving of GridFS files from MongoDB. It can be extended further, for example with per-user ACLs.