In many projects we use the schemaless database MongoDB. It provides the flexibility we needed, because many details of data representation became clear only during the projects. Besides mapping JSON, or BSON, objects elegantly into Python dictionaries through pymongo, MongoDB also provides the gridfs module for accessing objects stored through GridFS. In this note we will look at how to serve these files through an HTTP server written in Python.
GridFS
GridFS is a specification for storing binary files in MongoDB. Files are split into chunks stored separately. This makes it possible to access only parts of a file during reading, which is useful for large files. Chunks are 255 kB by default. If we want to access files stored in the fs bucket, GridFS creates two MongoDB collections: fs.files and fs.chunks. The first stores file metadata, while the second stores the chunks. Besides standard metadata, such as length, upload date, or MIME type, arbitrary custom metadata can be stored in the BSON document and queried later.
In Python, GridFS is accessed through the gridfs module, which provides a file-like interface over database objects. For much of the Python standard library, GridFS files look like ordinary filesystem files and have methods such as read(), close(), and seek().
The goal is to show how to serve GridFS files to an HTTP client application.
Non-Python Solutions
Besides a Python solution, several alternatives exist. In principle, they either map GridFS into the filesystem using FUSE or provide a web-server module, for example for nginx.
At the time, none of these web-server modules was available in Debian. Therefore we extended an existing HTTP server written in Python to handle GridFS files.
Python Solution: web2py
I had already mentioned streaming GridFS files in web2py in an older note on abclinuxu.cz. Because a GridFS file is a file-like object in Python, it can be streamed directly from web2py. In the model, create a database connection and a GridFS instance:
db = MongoClient(settings.db_uri).stitky123
fs = gridfs.GridFS(db, 'files')In the controller:
def download():
id = request.args[0]
id = uuid.UUID(id)
db_fr = fs.get(id)
response.headers['Content-Type'] = db_fr.content_type
return response.stream(db_fr, 1024)Python Solution: Tornado
The project that needed GridFS file serving was built on the Tornado web server. Tornado works in asynchronous non-blocking I/O mode and is therefore fast.
Connecting GridFS with Tornado is simple in principle: implement a custom GET handler that obtains a GridFS file based on the path and serves it gradually. Because we needed to serve audio and video content, and therefore support Accept-Ranges requests, I chose another route. Tornado already contains StaticFileHandler with support for these requests, where only the requested byte range is sent.
The work was to derive a new handler from StaticFileHandler and reimplement methods so that gridfs simulates filesystem access.
Implementation
Imports:
from tornado.ioloop import IOLoop
from tornado.httpserver import HTTPServer
from tornado.web import StaticFileHandler, HTTPError, Application
from pymongo import MongoClient
import gridfs
from bson.objectid import ObjectId, InvalidIdThe new handler:
class GridFSHandler(StaticFileHandler):
def initialize(self, fs):
self.root = ''
self.fs = fs
self.fr = None
def get_version(self, url_path):
return None
def get_content(self, abspath, start=None, end=None):
if start is not None:
self.fr.seek(start)
if end is not None:
remaining = end - (start or 0)
else:
remaining = None
while True:
chunk_size = self.fr.chunk_size
if remaining is not None and remaining < chunk_size:
chunk_size = remaining
chunk = self.fr.read(chunk_size)
if chunk:
if remaining is not None:
remaining -= len(chunk)
yield chunk
else:
if remaining is not None:
assert remaining == 0
return
def get_content_type(self):
return self.fr.content_type
def get_content_size(self):
return self.fr.length
def get_modified_time(self):
return self.fr.upload_date
def get_absolute_path(self, root, path):
try:
return ObjectId(path)
except InvalidId:
raise HTTPError(404)
def validate_absolute_path(self, root, absolute_path):
try:
self.fr = self.fs.get(absolute_path)
except gridfs.NoFile:
raise HTTPError(404)
return absolute_path
def get_content_version(self, abspath):
return abspath
def compute_etag(self):
return self.fr.md5Implementation notes:
- During initialisation, the only parameter is the GridFS object pointing to the bucket where files are stored.
- Metadata such as
content_type,content_size, andmodified_timecomes directly from the MongoDB object in thefilescollection. get_absolute_path()returns anObjectIdinstead of an absolute path.validate_absolute_path()opens the GridFS file based on theObjectId; if it does not exist, it returns 404.- The file’s MD5 stored in the database is used as the ETag.
get_content()is only a modification of the originalStaticFileHandlermethod.- A GridFS file opened for reading does not need to be closed;
GridOut.close()in the gridfs module is empty.
The rest of the code for starting the HTTP server:
if __name__ == "__main__":
db = MongoClient('mongodb://localhost/db').get_default_database()
fs = gridfs.GridFS(db, 'data')
application = Application([
(r"/data/(.*)", GridFSHandler, {'fs': fs}),
])
server = HTTPServer(application)
server.listen(8888)
IOLoop.instance().start()Conclusion
First, upload a file into the MongoDB GridFS bucket named data; see the GridFS documentation for details. The file can then be downloaded from a URI such as http://localhost:8888/data/53f64dba1608c0780e7dcaad, where 8888 is the server port and 53f64dba1608c0780e7dcaad is the file’s ObjectId.
The code above is simple, but it enables straightforward serving of GridFS files from MongoDB. It can be extended further, for example with per-user ACLs.