16장 Data in a Box-Persistent Storage
Table of contents
- Flat Text Files
- Padded Text Files
- Tabular Text Files
- Binary Files
- Relational Databases
- NoSQL Data Stores
- Full-Text Databases
- Coming Up
- Things to Do
레코드
는 개별 필드
로 구성된 관련 데이터의 덩어리다.
Flat Text Files
The simplest persistence is a plain old flat file. This works well if your data has a very
simple structure and you exchange all of it between disk and memory. Plain text data
might be suitable for this treatment.
Padded Text Files
In this format, each field in a record has a fixed width, and is padded (usually with
space characters) to that width in the file, giving each line (record) the same width. A
programmer can use seek() to jump around the file and only read and write the
records and fields that are needed.
Tabular Text Files
With simple text files, the only level of organization is the line. Sometimes, you want
more structure than that. You might want to save data for your program to use later,
or send data to another program.
There are many formats, and here’s how you can distinguish them:
- A separator, or delimiter, character like tab (‘\t’), comma (‘,’), or vertical bar (‘|’). This is an example of the comma-separated values (CSV) format.
- ’<’ and ‘>’ around tags. Examples include XML and HTML.
- Punctuation. An example is JavaScript Object Notation (JSON).
- Indentation. An example is YAML (which is recursively defined as “YAML Ain’t Markup Language”).
- Miscellaneous, such as configuration files for programs.
Each of these structured file formats can be read and written by at least one Python module.
CSV
Delimited files are often used as an exchange format for spreadsheets and databases.
You could read CSV files manually, a line at a time, splitting each line into fields at comma separators, and adding the results to data structures such as lists and dictionaries. But it’s better to use the standard csv module, because parsing these files can get more complicated than you think. Following are a few important characteristics to keep in mind when working with CSV:
Some have alternate delimiters besides a comma: ‘ ’ and ‘\t’ (tab) are common. - Some have escape sequences. If the delimiter character can occur within a field, the entire field might be surrounded by quote characters or preceded by some escape character.
- Files have different line-ending characters. Unix uses ‘\n’, Microsoft uses ‘\r\n’, and Apple used to use ‘\r’ but now uses ‘\n’.
- The first line may contain column names.
import csv
# data
villains = [
['Doctor', 'No'],
['Rosa', 'Klebb'],
['Mister', 'Big'],
['Auric', 'Goldfinger'],
['Ernst', 'Blofeld'],
]
# csv file 쓰기
with open('villains', 'wt', newline='') as fout: # a context manager
csvout = csv.writer(fout)
csvout.writerows(villains)
# csv file 읽기
with open('villains', 'rt') as fin: # context manager
cin = csv.reader(fin)
villains = [row for row in cin] # a list comprehension
print(villains)
'''
[['Doctor', 'No'], ['Rosa', 'Klebb'], ['Mister', 'Big'], ['Auric', 'Goldfinger'], ['Ernst', 'Blofeld']]
'''
딕셔너리 형식으로 읽고 쓰기
import csv
# data
villains = [
['Doctor', 'No'],
['Rosa', 'Klebb'],
['Mister', 'Big'],
['Auric', 'Goldfinger'],
['Ernst', 'Blofeld'],
]
# csv file 딕셔너리 형식으로 읽기
with open('villains', 'rt') as fin: # context manager
cin = csv.DictReader(fin,fieldnames=['first','last'])
villains = [row for row in cin] # a list comprehension
print(villains)
'''
[
{'first': 'Doctor', 'last': 'No'},
{'first': 'Rosa', 'last': 'Klebb'},
{'first': 'Mister', 'last': 'Big'},
{'first': 'Auric', 'last': 'Goldfinger'},
{'first': 'Ernst', 'last': 'Blofeld'}
]
'''
# csv file 딕셔너리 형식으로 쓰기
with open('villains', 'wt', newline='') as fout: # a context manager
csvout = csv.DictWriter(fout, fieldnames=['first','last'])
csvout.writerows(villains)
Now, let’s read it back. By omitting the fieldnames argument in the DictReader() call, we tell it to use the values in the first line of the file (first,last) as column labels and matching dictionary keys:
>>> import csv
>>> with open('villains.csv', 'rt') as fin:
... cin = csv.DictReader(fin)
... villains = [row for row in cin]
>>> print (villains)
[OrderedDict([('first', 'Doctor'), ('last', 'No')]),
OrderedDict([('first', 'Rosa'), ('last', 'Klebb')]),
OrderedDict([('first', 'Mister'), ('last', 'Big')]),
OrderedDict([('first', 'Auric'), ('last', 'Goldfinger')]),
OrderedDict([('first', 'Ernst'), ('last', 'Blofeld')])]
XML
Delimited files convey only two dimensions: rows (lines) and columns (fields within a line). If you want to exchange data structures among programs, you need a way to encode hierarchies, sequences, sets, and other structures as text.
XML is a prominent markup format that does this. It uses tags to delimit data, as in this sample menu.xml file:
<?xml version="1.0"?>
<menu>
<breakfast hours="7-11">
<item price="$6.00">breakfast burritos</item>
<item price="$4.00">pancakes</item>
</breakfast>
<lunch hours="11-3">
<item price="$5.00">hamburger</item>
</lunch>
<dinner hours="3-10">
<item price="8.00">spaghetti</item>
</dinner>
</menu>
Following are a few important characteristics of XML:
- Tags begin with a < character. The tags in this sample were menu, breakfast, lunch, dinner, and item.
- Whitespace is ignored.
- Usually a start tag such as <menu> is followed by other content and then a final matching end tag such as </menu>.
- Tags can nest within other tags to any level. In this example, item tags are chil‐ dren of the breakfast, lunch, and dinner tags; they, in turn, are children of menu.
- Optional attributes can occur within the start tag. In this example, price is an attribute of item.
- Tags can contain values. In this example, each item has a value, such as pancakes for the second breakfast item.
- If a tag named thing has no values or children, it can be expressed as the single tag by including a forward slash just before the closing angle bracket, such as
, rather than a start and end tag, like
. - The choice of where to put data—attributes, values, child tags—is somewhat arbitrary. For instance, we could have written the last item tag as
- .
XML is often used for data feeds and messages and has subformats like RSS and Atom.
Some industries have many specialized XML formats, such as the finance field.
XML’s über-flexibility has inspired multiple Python libraries that differ in approach and capabilities.
The simplest way to parse XML in Python is by using the standard ElementTree module. Here’s a little program to parse the menu.xml file and print some tags and attributes:
import xml.etree.ElementTree as et
tree = et.ElementTree(file='menu.xml')
root = tree.getroot()
print(root.tag)
for child in root:
print ('tag:', child.tag, 'attributes:', child.attrib)
for grandchild in child:
print (' \t tag:', grandchild.tag, 'attributes:', grandchild.attrib)
menu
tag: breakfast attributes: {'hours': '7-11'}
tag: item attributes: {'price': '$6.00'}
tag: item attributes: {'price': '$4.00'}
tag: lunch attributes: {'hours': '11-3'}
tag: item attributes: {'price': '$5.00'}
tag: dinner attributes: {'hours': '3-10'}
tag: item attributes: {'price': '8.00'}
>>> len(root) # number of menu sections
3
>>> len(root[0]) # number of breakfast items
2
For each element in the nested lists, tag is the tag string and attrib is a dictionary of its attributes. ElementTree has many other ways of searching XML-derived data, modifying it, and even writing XML files. The ElementTree documentation has the details.
Other standard Python XML libraries include the following:
- xml.dom
The Document Object Model (DOM), familiar to JavaScript developers, represents web documents as hierarchical structures. This module loads the entire XML file into memory and lets you access all the pieces equally.
Simple API for XML, or SAX, parses XML on the fly, so it does not have to load everything into memory at once. Therefore, it can be a good choice if you need to process very large streams of XML.
An XML Security Note
You can use all the formats described in this chapter to save objects to files and read them back again. It’s possible to exploit this process and cause security problems.
For example, the following XML snippet from the billion laughs Wikipedia page defines 10 nested entities, each expanding the lower level 10 times for a total expansion of one billion:
<?xml version="1.0"?>
<!DOCTYPE lolz [
<!ENTITY lol "lol">
<!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
<!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;">
<!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
<!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
<!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;">
<!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;">
<!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;">
<!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;">
<!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">
]>
<lolz>&lol9;</lolz>
The bad news: billion laughs would blow up all of the XML libraries mentioned in the
previous sections. Defused XML lists this attack and others, along with the vulnera‐
bility of Python libraries.
The link shows how to change the settings for many of the libraries to avoid these
problems. Also, you can use the defusedxml library as a security frontend for the
other libraries:
>>> # insecure:
>>> from xml.etree.ElementTree import parse
>>> et = parse(xmlfile)
>>> # protected:
>>> from defusedxml.ElementTree import parse
>>> et = parse(xmlfile)
The standard Python site also has its own page on XML vulnerabilities.
HTML
Gigagobs of data are saved as Hypertext Markup Language (HTML), the basic docu‐
ment format of the web. The problem is that much of it doesn’t follow the HTML
rules, which can make it difficult to parse. HTML is a better display format than a
data interchange format. Because this chapter is intended to describe fairly well-
defined data formats, I’ve separated out the discussion about HTML to Chapter 18.
JSON
JavaScript Object Notation (JSON) has become a very popular data interchange format, beyond its JavaScript origins. The JSON format is a subset of JavaScript, and often legal Python syntax, as well. Its close fit to Python makes it a good choice for data interchange among programs. You’ll see many examples of JSON for web development in Chapter 18.
Unlike the variety of XML modules, there’s one main JSON module, with the unforgettable name json. This program encodes (dumps) data to a JSON string and decodes (loads) a JSON string back to data. In this next example, let’s build a Python data structure containing the data from the earlier XML example:
import json
menu = \
{
"breakfast": {
"hours": "7-11",
"item": {
"breakfast burritos": "$6.00",
"pancakes": "$4.00"
}
},
"lunch": {
"hours": "11-3",
"item" : {
"hamburger": "$5.00"
}
},
"dinner": {
"hours": "3-10",
"item": {
"spaghetti": "$8.00"
}
}
}
menu_json = json.dumps(menu)
print(menu_json)
menu2 = json.loads(menu_json)
print(menu2)
{"breakfast": {"hours": "7-11", "item": {"breakfast burritos": "$6.00", "pancakes": "$4.00"}}, "lunch": {"hours": "11-3", "item": {"hamburger": "$5.00"}}, "dinner": {"hours": "3-10", "item": {"spaghetti": "$8.00"}}}
And now, let’s turn the JSON string menu_json back into a Python data structure(menu2) by using loads():
menu and menu2 are both dictionaries with the same keys and values.
You might get an exception while trying to encode or decode some objects, including objects such as datetime (covered in detail in Chapter 13), as demonstrated here:
>>> import datetime
>>> import json
>>> now = datetime.datetime.utcnow()
>>> now
datetime.datetime(2013, 2, 22, 3, 49, 27, 483336)
>>> json.dumps(now)
Traceback (most recent call last):
# ... (deleted stack trace to save trees)
TypeError: datetime.datetime(2013, 2, 22, 3, 49, 27, 483336)
is not JSON serializable
>>>
This can happen because the JSON standard does not define date or time types; it expects you to define how to handle them. You could convert the datetime to something JSON understands, such as a string or an epoch value (see Chapter 13):
import json
from time import mktime
import datetime
now = datetime.datetime.now()
now_str = str(now)
date_json = json.dumps(now_str)
print(date_json) # "2023-01-05 10:11:11.949568"
now_epoch = int(mktime(now.timetuple()))
date_json = json.dumps(now_epoch)
print(date_json) # 1672881071
If the datetime value could occur in the middle of normally converted data types, it might be annoying to make these special conversions. You can modify how JSON is encoded by using inheritance, which is described in Chapter 10. Python’s JSON documentation gives an example of this for complex numbers, which also makes JSON play dead. Let’s modify it for datetime:
import json
import datetime
from time import mktime
now = datetime.datetime.utcnow()
class DTEncoder (json.JSONEncoder):
def default(self, obj):
# isinstance() checks the type of obj
if isinstance(obj, datetime.datetime):
return int(mktime(obj.timetuple()))
# else it's something the normal decoder knows:
return json.JSONEncoder.default(self, obj)
a = json.dumps(now, cls=DTEncoder)
print(a) # 1672849185
The new class DTEncoder is a subclass, or child class, of JSONEncoder. We need to override its only default() method to add datetime handling. Inheritance ensures that everything else will be handled by the parent class.
The isinstance() function checks whether the object obj is of the class date time.datetime. Because everything in Python is an object, isinstance() works everywhere:
>>> import datetime
>>> now = datetime.datetime.utcnow()
>>> type(now)
<class 'datetime.datetime'>
>>> isinstance(now, datetime.datetime)
True
>>> type(234)
<class 'int'>
>>> isinstance(234, int)
True
>>> type('hey')
<class 'str'>
>>> isinstance('hey', str)
True
For JSON and other structured text formats, you can load from a file into data structures without knowing anything about the structures ahead of time. Then, you can walk through the structures by using isinstance() and type-appropriate methods to examine their values. For example, if one of the items is a dictionary, you can extract contents through keys(), values(), and items().
After making you do it the hard way, it turns out that there’s an even easier way to convert datetime objects to JSON:
>>> import datetime
>>> import json
>>> now = datetime.datetime.utcnow()
>>> json.dumps(now, default=str)
'"2019-04-17 21:54:43.617337"'
That default=str tells json.dumps() to apply the str() conversion function for data types that it doesn’t understand. This works because the definition of the date time.datetime class includes a __str__()
method.
YAML
Similar to JSON, YAML has keys and values, but handles more data types such as dates and times. The standard Python library does not yet include YAML handling, so you need to install a third-party library named yaml to manipulate it. (((“dump() function”)))(((“load() function”)))load() converts a YAML string to Python data, whereas dump() does the opposite.
The following YAML file, mcintyre.yaml, contains information on the Canadian poet James McIntyre, including two of his poems:
name:
first: James
last: McIntyre
dates:
birth: 1828-05-25
death: 1906-03-31
details:
bearded: true
themes: [cheese, Canada]
books:
url: http://www.gutenberg.org/files/36068/36068-h/36068-h.htm
poems:
- title: 'Motto'
text: |
Politeness, perseverance and pluck,
To their possessor will bring good luck.
- title: 'Canadian Charms'
text: |
Here industry is not in vain,
For we have bounteous crops of grain,
And you behold on every field
Of grass and roots abundant yield,
But after all the greatest charm
Is the snug home upon the farm,
And stone walls now keep cattle warm.
Values such as true, false, on, and off are converted to Python booleans. Integers and strings are converted to their Python equivalents. Other syntax creates lists and dictionaries:
>>> import yaml
>>> with open('mcintyre.yaml', 'rt') as fin:
>>> text = fin.read()
>>> data = yaml.load(text)
>>> data['details']
{'themes': ['cheese', 'Canada'], 'bearded': True}
>>> len(data['poems'])
2
The data structures that are created match those in the YAML file, which in this case are more than one level deep in places. You can get the title of the second poem with this dict/list/dict reference:
>>> data['poems'][1]['title']
'Canadian Charms'
PyYAML can load Python objects from strings, and this is dangerous. Use safe_load() instead of load() if you’re importing YAML that you don’t trust. Better yet, always use safe_load(). Read Ned Batchelder’s blog post “War is Peace” for a description of how unprotected YAML loading compromised the Ruby on Rails platform.
Tablib
After reading all of the previous sections, there’s one third-party package that lets you import, export, and edit tabular data in CSV, JSON, or YAML format,^1 as well as Microsoft Excel, Pandas DataFrame, and a few others. You install it with the familiar refrain (pip install tablib), and peek at the docs.
Pandas
This is as good place as any to introduce pandas—a Python library for structured data. It’s an excellent tool for handling real-life data issues:
Read and write many text and binary file formats:
- Text, with fields separated by commas (CSV), tabs (TSV), or other characters
- Fixed-width text
- Excel
- JSON
- HTML tables
- SQL
- HDF5
- and others.
- Group, split, merge, index, slice, sort, select, label
- Convert data types
- Change size or shape
- Handle missing data
- Generate random values
- Manage time series
The read functions return a DataFrame object, Pandas’ standard representation for two-dimensional data (rows and columns). It’s similar in some ways to a spreadsheet or a relational database table. Its one-dimensional little brother is a Series.
Example 16-2 demonstrates a simple application that reads our villains.csv file from Example 16-1.
Example 16-2. Read CSV with Pandas
>>> import pandas
>>>
>>> data = pandas.read_csv('villains.csv')
>>> print(data)
first last
0 Doctor No
1 Rosa Klebb
2 Mister Big
3 Auric Goldfinger
4 Ernst Blofeld
The variable data just shown is a DataFrame. which has many more tricks than a basic Python dictionary. It’s especially useful for heavy numeric work with NumPy, and data preparation for machine learning.
Refer to the “Getting Started” section of the documentation for Pandas’ features, and “10 Minutes to Pandas” for working examples.
Let’s use Pandas for a little calendar example—make a list of the first day of the first three months in 2019:
>>> import pandas
>>> dates = pandas.date_range('2019-01-01', periods=3, freq='MS')
>>> dates
DatetimeIndex(['2019-01-01', '2019-02-01', '2019-03-01'],
dtype='datetime64[ns]', freq='MS')
You could write something that does this, using the time and date functions described in Chapter 13, but it would be a lot more work—especially debugging (dates and times are frustrating). Pandas also handles many special date/time details, like business months and years.
Pandas will appear again later when I talk about mapping (“Geopandas” on page 479 ) and scientific applications (“Pandas” on page 498).
Configuration Files
Most programs offer various options or settings. Dynamic ones can be provided as program arguments, but long-lasting ones need to be kept somewhere. The temptation to define your own quick-and-dirty config file format is strong—but resist it. It often turns out to be dirty, but not so quick. You need to maintain both the writer program and the reader program (sometimes called a parser). There are good alternatives that you can just drop into your program, including those in the previous sections.
Here, we’ll use the standard configparser module, which handles Windows-style .ini files. Such files have sections of key = value definitions. Here’s a minimal settings.cfg file:
[english]
greeting = Hello
[french]
greeting = Bonjour
[files]
home = /usr/local
# simple interpolation:
bin = %(home)s/bin
Here’s the code to read it into Python data structures:
>>> import configparser
>>> cfg = configparser.ConfigParser()
>>> cfg.read('settings.cfg')
['settings.cfg']
>>> cfg
<configparser.ConfigParser object at 0x1006be4d0>
>>> cfg['french']
<Section: french>
>>> cfg['french']['greeting']
'Bonjour'
>>> cfg['files']['bin']
'/usr/local/bin'
Other options are available, including fancier interpolation. See the configparser documentation. If you need deeper nesting than two levels, try YAML or JSON.
Binary Files
Some file formats were designed to store particular data structures but are neither
relational nor NoSQL databases. The sections that follow present some of them.
Padded Binary Files and Memory Mapping
These are similar to padded text files, but the contents may be binary, and the pad‐
ding byte may be \x00 instead of a space character. Each record has a fixed size, as
does each field within a record. This makes it simpler to seek() throughout the file
for the desired records and fields. Every operation on the data is manual, so this
approach tends to be used only in very low-level (e.g., close to the hardware)
situations.
Data in this form can be memory mapped with the standard mmap library. See some
examples, and the standard documentation.
Spreadsheets
Spreadsheets, notably Microsoft Excel, are widespread binary data formats. If you can save your spreadsheet to a CSV file, you can read it by using the standard csv module that was described earlier. This will work for a binary xls file, xlrd, or tablib (mentioned earlier at “Tablib” on page 314).
HDF5
HDF5 is a binary data format for multidimensional or hierarchical numeric data. It’s used mainly in science, where fast random access to large datasets (gigabytes to terabytes) is a common requirement. Even though HDF5 could be a good alternative to databases in some cases, for some reason HDF5 is almost unknown in the business world. It’s best suited to WORM (write once/read many) applications for which database protection against conflicting writes is not needed. Here are some modules that you might find useful:
- h5py is a full-featured low-level interface. Read the documentation and code.
- PyTables is a bit higher-level, with database-like features. Read the documentation and code.
Both of these are discussed in terms of scientific applications of Python in Chapter 22. I’m mentioning HDF5 here in case you have a need to store and retrieve large amounts of data and are willing to consider something outside the box as well as the usual database solutions. A good example is the Million Song dataset, which has downloadable song data in HDF5 and SQLite formats.
TileDB
A recent successor to HDF5 for dense or spare array storage is TileDB. Install the
Python interface (which includes the TileDB library itself ) by running pip install
tiledb. This is aimed at scientific data and applications.
Relational Databases
Relational databases are only about 40 years old but are ubiquitous in the computing
world. You’ll almost certainly have to deal with them at one time or another. When
you do, you’ll appreciate what they provide:
- Access to data by multiple simultaneous users
- Protection from corruption by those users
- Efficient methods to store and retrieve the data
- Data defined by schemas and limited by constraints
- Joins to find relationships across diverse types of data
- A declarative (rather than imperative) query language: SQL (Structured Query Language)
These are called relational because they show relationships among different kinds of
data in the form of rectangular tables. For instance, in our menu example earlier,
there is a relationship between each item and its price.
A table is a rectangular grid of columns (data fields) and rows (individual data
records), similar to a spreadsheet. The intersection of a row and column is a table cell.
To create a table, name it and specify the order, names, and types of its columns. Each
row has the same columns, although a column may be defined to allow missing data
(called nulls) in cells. In the menu example, you could create a table with one row for
each item being sold. Each item has the same columns, including one for the price.
A column or group of columns is usually the table’s primary key; its values must be
unique in the table. This prevents adding the same data to the table more than once.
This key is indexed for fast lookups during queries. An index works a little like a book
index, making it fast to find a particular row.
Each table lives within a parent database, like a file within a directory. Two levels of
hierarchy help keep things organized a little better.
Yes, the word database is used in multiple ways: as the server, the
table container, and the data stored therein. If you’ll be referring to
all of them at the same time, it might help to call them database
server, database, and data.
If you want to find rows by some nonkey column value, define a secondary index on
that column. Otherwise, the database server must perform a table scan—a brute-force
search of every row for matching column values.
Tables can be related to each other with foreign keys, and column values can be con‐
strained to these keys.
SQL
SQL is not an API or a protocol, but a declarative language: you say what you want
rather than how to do it. It’s the universal language of relational databases. SQL quer‐
ies are text strings sent by a client to the database server, which in turn figures out
what to do with them.
There have been various SQL standard definitions, and all database vendors have
added their own tweaks and extensions, resulting in many SQL dialects. If you store
your data in a relational database, SQL gives you some portability. Still, dialect and
operational differences can make it difficult to move your data to another type of
database. There are two main categories of SQL statements:
DDL (data definition language)
Handles creation, deletion, constraints, and permissions for tables, databases,
and users.
DML (data manipulation language)
Handles data insertions, selects, updates, and deletions.
Table 16-1 lists the basic SQL DDL commands.
Table 16-1. Basic SQL DDL commands
Operation SQL pattern SQL example
Create a database CREATE DATABASE dbname CREATE DATABASE d
Select current database USE dbname USE d
Delete a database and its tables DROP DATABASE dbname DROP DATABASE d
Create a table CREATE TABLE tbname ( coldefs ) CREATE TABLE t (id INT, count
INT)
Delete a table DROP TABLE tbname DROP TABLE t
Remove all rows from a table TRUNCATE TABLE tbname TRUNCATE TABLE t
Why all the CAPITAL LETTERS? SQL is case-insensitive, but it’s a
tradition (don’t ask me why) to SHOUT its keywords in code
examples to distinguish them from column names.
Relational Databases | 319
The main DML operations of a relational database are often known by the acronym
CRUD:
- Create by using the SQL INSERT statement
- Read by using SELECT
- Update by using UPDATE
- Delete by using DELETE
Table 16-2 looks at the commands available for SQL DML.
Table 16-2. Basic SQL DML commands
Operation SQL pattern SQL example
Add a row INSERT INTO tbname VAL
UES( ... )
INSERT INTO t VALUES(7, 40)
Select all rows and columns SELECT * FROM tbname SELECT * FROM t
Select all rows, some columns SELECT cols FROM tbname SELECT id, count FROM t
Select some rows, some
columns
SELECT cols FROM tbname WHERE
condition
SELECT id, count from t WHERE
count > 5 AND id = 9
Change some rows in a column UPDATE tbname SET col = value
WHERE condition
UPDATE t SET count=3 WHERE id=5
Delete some rows DELETE FROM tbname WHERE
condition
DELETE FROM t WHERE count <= 10
OR id = 16
DB-API
An application programming interface (API) is a set of functions that you can call to
get access to some service. DB-API is Python’s standard API for accessing relational
databases. Using it, you can write a single program that works with multiple kinds of
relational databases instead of writing a separate program for each one. It’s similar to
Java’s JDBC or Perl’s dbi.
Its main functions are the following:
connect()
Make a connection to the database; this can include arguments such as username,
password, server address, and others.
cursor()
Create a cursor object to manage queries.
execute() and executemany()
Run one or more SQL commands against the database.
fetchone(), fetchmany(), and fetchall()
Get the results from execute().
The Python database modules in the coming sections conform to DB-API, often with
extensions and some differences in details.
SQLite
SQLite is a good, light, open source relational database. It’s implemented as a stan‐
dard Python library, and stores databases in normal files. These files are portable
across machines and operating systems, making SQLite a very portable solution for
simple relational database applications. It isn’t as full featured as MySQL or Post‐
greSQL, but it does support SQL, and it manages multiple simultaneous users. Web
browsers, smartphones, and other applications use SQLite as an embedded database.
You begin with a connect() to the local SQLite database file that you want to use or
create. This file is the equivalent of the directory-like database that parents tables in
other servers. The special string ‘:memory:’ creates the database in memory only;
this is fast and useful for testing but will lose data when your program terminates or if
your computer goes down.
For the next example, let’s make a database called enterprise.db and the table zoo to
manage our thriving roadside petting zoo business. The table columns are as follows:
critter
A variable length string, and our primary key.
count
An integer count of our current inventory for this animal.
damages
The dollar amount of our current losses from animal–human interactions.
>>> import sqlite3
>>> conn = sqlite3.connect('enterprise.db')
>>> curs = conn.cursor()
>>> curs.execute('''CREATE TABLE zoo
(critter VARCHAR(20) PRIMARY KEY,
count INT,
damages FLOAT)''')
<sqlite3.Cursor object at 0x1006a22d0>
Python’s triple quotes are handy when creating long strings such as SQL queries.
Now, add some animals to the zoo:
>>> curs.execute('INSERT INTO zoo VALUES("duck", 5, 0.0)')
<sqlite3.Cursor object at 0x1006a22d0>
>>> curs.execute('INSERT INTO zoo VALUES("bear", 2, 1000.0)')
<sqlite3.Cursor object at 0x1006a22d0>
Relational Databases | 321
There’s a safer way to insert data, using a placeholder:
>>> ins = 'INSERT INTO zoo (critter, count, damages) VALUES(?, ?, ?)'
>>> curs.execute(ins, ('weasel', 1, 2000.0))
<sqlite3.Cursor object at 0x1006a22d0>
This time, we used three question marks in the SQL to indicate that we plan to insert
three values, and then pass those three values as a tuple to the execute() function.
Placeholders handle tedious details such as quoting. They protect you against SQL
injection, a kind of external attack that inserts malicious SQL commands into the sys‐
tem (and which is common on the web).
Now, let’s see if we can get all our animals out again:
>>> curs.execute('SELECT * FROM zoo')
<sqlite3.Cursor object at 0x1006a22d0>
>>> rows = curs.fetchall()
>>> print (rows)
[('duck', 5, 0.0), ('bear', 2, 1000.0), ('weasel', 1, 2000.0)]
Let’s get them again, but ordered by their counts:
>>> curs.execute('SELECT * from zoo ORDER BY count')
<sqlite3.Cursor object at 0x1006a22d0>
>>> curs.fetchall()
[('weasel', 1, 2000.0), ('bear', 2, 1000.0), ('duck', 5, 0.0)]
Hey, we wanted them in descending order:
>>> curs.execute('SELECT * from zoo ORDER BY count DESC')
<sqlite3.Cursor object at 0x1006a22d0>
>>> curs.fetchall()
[('duck', 5, 0.0), ('bear', 2, 1000.0), ('weasel', 1, 2000.0)]
Which type of animal is costing us the most?
>>> curs.execute('''SELECT * FROM zoo WHERE
... damages = (SELECT MAX(damages) FROM zoo)''')
<sqlite3.Cursor object at 0x1006a22d0>
>>> curs.fetchall()
[('weasel', 1, 2000.0)]
You would have thought it was the bears. It’s always best to check the actual data.
Before we leave SQLite, we need to clean up. If we opened a connection and a cursor,
we need to close them when we’re done:
>>> curs.close()
>>> conn.close()
MySQL
MySQL is a very popular open source relational database. Unlike SQLite, it’s an actual
server, so clients can access it from different devices across the network.
Table 16-3 lists the drivers you can use to access MySQL from Python. For more
details on all Python MySQL drivers, see the python.org wiki.
Table 16-3. MySQL drivers
Name Link Pypi package Import as Notes
mysqlclient https://https://
mysqlclient.readthedocs.io
mysql-connector-
python
MySQLdb
MySQL
Connector
http://bit.ly/mysql-cpdg mysql-connector-
python
mysql.connec
tor
PYMySQL https://github.com/petehunt/PyMySQL pymysql pymysql
oursql http://pythonhosted.org/oursql oursql oursql Requires the
MySQL C client
libraries
PostgreSQL
PostgreSQL is a full-featured open source relational database. Indeed in many ways,
it’s more advanced than MySQL. Table 16-4 presents the Python drivers you can use
to access it.
Table 16-4. PostgreSQL drivers
Name Link Pypi package Import as Notes
psycopg2 http://initd.org/psycopg psycopg2 psycopg2 Needs pg_config from PostgreSQL client
tools
py-postgresql https://pypi.org/project/py-
postgresql
py-postgresql postgresql
The most popular driver is psycopg2, but its installation requires the PostgreSQL cli‐
ent libraries.
SQLAlchemy
SQL is not quite the same for all relational databases, and DB-API takes you only so
far. Each database implements a particular dialect reflecting its features and philoso‐
Relational Databases | 323
phy. Many libraries try to bridge these differences in one way or another. The most
popular cross-database Python library is SQLAlchemy.
It isn’t in the standard library, but it’s well known and used by many people. You can
install it on your system by using this command:
$ pip install sqlalchemy
You can use SQLAlchemy on several levels:
- The lowest level manages database connection pools, executes SQL commands, and returns results. This is closest to the DB-API.
- Next up is the SQL expression language, which lets you express queries in a more Python-oriented way.
- Highest is the ORM (Object Relational Model) layer, which uses the SQL Expres‐ sion Language and binds application code with relational data structures.
As we go along, you’ll understand what the terms mean in those levels. SQLAlchemy
works with the database drivers documented in the previous sections. You don’t need
to import the driver; the initial connection string you provide to SQLAlchemy will
determine it. That string looks like this:
dialect + driver :// user : password @ host : port / dbname
The values you put in this string are as follows:
dialect
The database type
driver
The particular driver you want to use for that database
user and password
Your database authentication strings
host and port
The database server’s location (: port is needed only if it’s not the standard one
for this server)
dbname
The database to initially connect to on the server
Table 16-5 lists the dialects and drivers.
Table 16-5. SQLAlchemy connection
dialect driver
sqlite pysqlite (or omit)
mysql mysqlconnector
mysql pymysql
mysql oursql
postgresql psycopg2
postgresql pypostgresql
See also the SQLAlchemy details on dialects for MySQL, SQLite, PostgreSQL, and
other databases.
The engine layer
First, let’s try the lowest level of SQLAlchemy, which does little more than the base
DB-API functions.
Let’s try it with SQLite, which is already built in to Python. The connection string for
SQLite skips the host , port , user , and password. The dbname tells SQLite what file to
use to store your database. If you omit the dbname , SQLite builds a database in mem‐
ory. If the dbname starts with a slash (/), it’s an absolute filename on your computer
(as in Linux and macOS; for example, C:\ on Windows). Otherwise, it’s relative to
your current directory.
The following segments are all part of one program, separated here for explanation.
To begin, you need to import what we need. The following is an example of an import
alias, which lets us use the string sa to refer to SQLAlchemy methods. I do this
mainly because sa is a lot easier to type than sqlalchemy:
>>> import sqlalchemy as sa
Connect to the database and create the storage for it in memory (the argument string
‘sqlite:///:memory:’ also works):
>>> conn = sa.create_engine('sqlite://')
Create a database table called zoo that comprises three columns:
>>> conn.execute('''CREATE TABLE zoo
... (critter VARCHAR(20) PRIMARY KEY,
... count INT,
... damages FLOAT)''')
<sqlalchemy.engine.result.ResultProxy object at 0x1017efb10>
Relational Databases | 325
Running conn.execute() returns a SQLAlchemy object called a ResultProxy. You’ll
soon see what to do with it.
By the way, if you’ve never made a database table before, congratulations. Check that
one off your bucket list.
Now, insert three sets of data into your new empty table:
>>> ins = 'INSERT INTO zoo (critter, count, damages) VALUES (?, ?, ?)'
>>> conn.execute(ins, 'duck', 10, 0.0)
<sqlalchemy.engine.result.ResultProxy object at 0x1017efb50>
>>> conn.execute(ins, 'bear', 2, 1000.0)
<sqlalchemy.engine.result.ResultProxy object at 0x1017ef090>
>>> conn.execute(ins, 'weasel', 1, 2000.0)
<sqlalchemy.engine.result.ResultProxy object at 0x1017ef450>
Next, ask the database for everything that we just put in:
>>> rows = conn.execute('SELECT * FROM zoo')
In SQLAlchemy, rows is not a list; it’s that special ResultProxy thing that we can’t
print directly:
>>> print (rows)
<sqlalchemy.engine.result.ResultProxy object at 0x1017ef9d0>
However, you can iterate over it like a list, so we can get a row at a time:
>>> for row in rows:
... print (row)
...
('duck', 10, 0.0)
('bear', 2, 1000.0)
('weasel', 1, 2000.0)
That was almost the same as the SQLite DB-API example that you saw earlier. The
one advantage is that we didn’t need to import the database driver at the top; SQL‐
Alchemy figured that out from the connection string. Just changing the connection
string would make this code portable to another type of database. Another plus is
SQLAlchemy’s connection pooling, which you can read about at its documentation
site.
The SQL Expression Language
The next level up is SQLAlchemy’s SQL Expression Language. It introduces functions
to create the SQL for various operations. The Expression Language handles more of
the SQL dialect differences than the lower-level engine layer does. It can be a handy
middle-ground approach for relational database applications.
Here’s how to create and populate the zoo table. Again, these are successive fragments
of a single program.
The import and connection are the same as before:
>>> import sqlalchemy as sa
>>> conn = sa.create_engine('sqlite://')
To define the zoo table, we begin using some of the Expression Language instead of
SQL:
>>> meta = sa.MetaData()
>>> zoo = sa.Table('zoo', meta,
... sa.Column('critter', sa.String, primary_key=True),
... sa.Column('count', sa.Integer),
... sa.Column('damages', sa.Float)
... )
>>> meta.create_all(conn)
Check out the parentheses in that multiline call in the preceding example. The struc‐
ture of the Table() method matches the structure of the table. Just as our table con‐
tains three columns, there are three calls to Column() inside the parentheses of the
Table() method call.
Meanwhile, zoo is some magic object that bridges the SQL database world and the
Python data structure world.
Insert the data with more Expression Language functions:
... conn.execute(zoo.insert(('bear', 2, 1000.0)))
<sqlalchemy.engine.result.ResultProxy object at 0x1017ea910>
>>> conn.execute(zoo.insert(('weasel', 1, 2000.0)))
<sqlalchemy.engine.result.ResultProxy object at 0x1017eab10>
>>> conn.execute(zoo.insert(('duck', 10, 0)))
<sqlalchemy.engine.result.ResultProxy object at 0x1017eac50>
Next, create the SELECT statement (zoo.select() selects everything from the table
represented by the zoo object, such as SELECT * FROM zoo would do in plain SQL):
>>> result = conn.execute(zoo.select())
Finally, get the results:
>>> rows = result.fetchall()
>>> print (rows)
[('bear', 2, 1000.0), ('weasel', 1, 2000.0), ('duck', 10, 0.0)]
The Object-Relational Mapper (ORM)
In the previous section, the zoo object was a mid-level connection between SQL and
Python. At the top layer of SQLAlchemy, the Object-Relational Mapper (ORM) uses
the SQL Expression Language but tries to make the actual database mechanisms
invisible. You define classes, and the ORM handles how to get their data in and out of
the database. The basic idea behind that complicated phrase, “object-relational
Relational Databases | 327
mapper,” is that you can refer to objects in your code and thus stay close to the way
Python likes to operate while still using a relational database.
We’ll define a Zoo class and hook it into the ORM. This time, we make SQLite use the
file zoo.db so that we can confirm that the ORM worked.
As in the previous two sections, the snippets that follow are actually one program
separated by explanations. Don’t worry if you don’t understand some if it. The SQL‐
Alchemy documentation has all the details—and this stuff can get complex. I just
want you to get an idea of how much work it is to do this, so that you can decide
which of the approaches discussed in this chapter suits you.
The initial import is the same, but this time we need another something also:
>>> import sqlalchemy as sa
>>> from sqlalchemy.ext.declarative import declarative_base
Here, we make the connection:
>>> conn = sa.create_engine('sqlite:///zoo.db')
Now, we get into SQLAlchemy’s ORM. We define the Zoo class and associate its
attributes with table columns:
>>> Base = declarative_base()
>>> class Zoo (Base):
... __tablename__ = 'zoo'
... critter = sa.Column('critter', sa.String, primary_key=True)
... count = sa.Column('count', sa.Integer)
... damages = sa.Column('damages', sa.Float)
... def __init__(self, critter, count, damages):
... self.critter = critter
... self.count = count
... self.damages = damages
... def __repr__(self):
... return "<Zoo({}, {}, {})>".format(self.critter, self.count,
... self.damages)
The following line magically creates the database and table:
>>> Base.metadata.create_all(conn)
You can then insert data by creating Python objects. The ORM manages these
internally:
>>> first = Zoo('duck', 10, 0.0)
>>> second = Zoo('bear', 2, 1000.0)
>>> third = Zoo('weasel', 1, 2000.0)
>>> first
<Zoo(duck, 10, 0.0)>
Next, we get the ORM to take us to SQL land. We create a session to talk to the
database:
>>> from sqlalchemy.orm import sessionmaker
>>> Session = sessionmaker(bind=conn)
>>> session = Session()
Within the session, we write the three objects that we created to the database. The
add() function adds one object, and add_all() adds a list:
>>> session.add(first)
>>> session.add_all([second, third])
Finally, we need to force everything to complete:
>>> session.commit()
Did it work? Well, it created a zoo.db file in the current directory. You can use the
command-line sqlite3 program to check it:
$ sqlite3 zoo.db
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
zoo
sqlite> select * from zoo;
duck|10|0.0
bear|2|1000.0
weasel|1|2000.0
The purpose of this section was to show what an ORM is and how it works at a high
level. The author of SQLAlchemy has written a full tutorial. After reading this, decide
which of the following levels would best fit your needs:
- Plain DB-API, as in the earlier SQLite section
- The SQLAlchemy engine
- The SQLAlchemy Expression Language
- The SQLAlchemy ORM
It seems like a natural choice to use an ORM to avoid the complexities of SQL. Should
you use one? Some people think ORMs should be avoided, but others think the criti‐
cism is overdone. Whoever’s right, an ORM is an abstraction, and all abstractions are
leaky and break down at some point. When your ORM doesn’t do what you want, you
must figure out both how it works and how to fix it in SQL. To borrow an internet
meme:
Some people, when confronted with a problem, think, “I know, I’ll use an ORM.” Now
they have two problems.
Relational Databases | 329
Use ORMs for simple applications or applications that map data pretty directly to
database tables. If the application is that simple, you may consider using straight SQL
or the SQL Expression Language.
Other Database Access Packages
If you’re looking for Python tools that will handle multiple databases, with more fea‐
tures than the bare db-api but less than SQLAlchemy, these are worth a look:
- dataset claims the goal “databases for lazy people”. It’s built on SQLAlchemy and provides a simple ORM for SQL, JSON, and CSV storage.
- records bills itself as “SQL for Humans.” It supports only SQL queries, using SQLAlchemy internally to handle SQL dialect issues, connection pooling, and other details. Its integration with tablib (mentioned at “Tablib” on page 314) lets you export data to CSV, JSON, and other formats.
NoSQL Data Stores
Relational tables are rectangular, but data come in many shapes and may be very dif‐
ficult to fit without significant effort and contortion. It’s a square peg/round hole
problem.
Some nonrelational databases have been written to allow more flexible data defini‐
tions as well as to process very large data sets or support custom data operations.
They’ve been collectively labeled NoSQL (formerly meaning no SQL, now the less
confrontational not only SQL).
The simplest type of NoSQL databases are key-value stores. One popularity ranking
shows some that I cover in the following sections.
The dbm Family
The dbm formats were around long before the NoSQL label was coined. They’re simple
key-value stores, often embedded in applications such as web browsers to maintain
various settings. A dbm database is like a Python dictionary in the following ways:
- You can assign a value to a key, and it’s automatically saved to the database on disk.
- You can query a key for its value.
The following is a quick example. The second argument to the following open()
method is ‘r’ to read, ‘w’ to write, and ‘c’ for both, creating the file if it doesn’t
exist:
>>> import dbm
>>> db = dbm.open('definitions', 'c')
To create key-value pairs, just assign a value to a key just as you would a dictionary:
>>> db['mustard'] = 'yellow'
>>> db['ketchup'] = 'red'
>>> db['pesto'] = 'green'
Let’s pause and check what we have so far:
>>> len(db)
3
>>> db['pesto']
b'green'
Now close, then reopen to see whether it actually saved what we gave it:
>>> db.close()
>>> db = dbm.open('definitions', 'r')
>>> db['mustard']
b'yellow'
Keys and values are stored as bytes. You cannot iterate over the database object db,
but you can get the number of keys by using len(). get() and setdefault() work as
they do for dictionaries.
Memcached
memcached is a fast in-memory key-value cache server. It’s often put in front of a data‐
base, or used to store web server session data.
You can download versions for Linux and macOS as well as Windows. If you want to
try out this section, you’ll need a running memcached server and Python driver.
There are many Python drivers; one that works with Python 3 is python3-memcached,
which you can install by using this command:
$ pip install python-memcached
To use it, connect to a memcached server, after which you can do the following:
- Set and get values for keys
- Increment or decrement a value
- Delete a key
Data keys and values are not persistent, and data that you wrote earlier might disap‐
pear. This is inherent in memcached—it’s a cache server, not a database, and it avoids
running out of memory by discarding old data.
NoSQL Data Stores | 331
You can connect to multiple memcached servers at the same time. In this next exam‐
ple, we’re just talking to one on the same computer:
>>> import memcache
>>> db = memcache.Client(['127.0.0.1:11211'])
>>> db.set('marco', 'polo')
True
>>> db.get('marco')
'polo'
>>> db.set('ducks', 0)
True
>>> db.get('ducks')
0
>>> db.incr('ducks', 2)
2
>>> db.get('ducks')
2
Redis
Redis is a data structure server. It handles keys and their values, but the values are
richer than those in other key-value stores. Like memcached, all of the data in a Redis
server should fit in memory. Unlike memcached, Redis can do the following:
- Save data to disk for reliability and restarts
- Keep old data
- Provide more data structures than simple strings
The Redis data types are a close match to Python’s, and a Redis server can be a useful
intermediary for one or more Python applications to share data. I’ve found it so use‐
ful that it’s worth a little extra coverage here.
The Python driver redis-py has its source code and tests on GitHub as well as docu‐
mentation. You can install it by using this command:
$ pip install redis
The Redis server has good documentation. If you install and start the Redis server on
your local computer (with the network nickname localhost), you can try the pro‐
grams in the following sections.
Strings
A key with a single value is a Redis string. Simple Python data types are automatically
converted. Connect to a Redis server at some host (default is localhost) and port
(default is 6379 ):
>>> import redis
>>> conn = redis.Redis()
Connecting to redis.Redis(‘localhost’) or redis.Redis(‘localhost’, 6379)
would have given the same result.
List all keys (none so far):
>>> conn.keys('*')
[]
Set a simple string (key ‘secret’), integer (key ‘carats’), and float (key ‘fever’):
>>> conn.set('secret', 'ni!')
True
>>> conn.set('carats', 24)
True
>>> conn.set('fever', '101.5')
True
Get the values back (as Python byte values) by key:
>>> conn.get('secret')
b'ni!'
>>> conn.get('carats')
b'24'
>>> conn.get('fever')
b'101.5'
Here, the setnx() method sets a value only if the key does not exist:
>>> conn.setnx('secret', 'icky-icky-icky-ptang-zoop-boing!')
False
It failed because we had already defined ‘secret’:
>>> conn.get('secret')
b'ni!'
The getset() method returns the old value and sets it to a new one at the same time:
>>> conn.getset('secret', 'icky-icky-icky-ptang-zoop-boing!')
b'ni!'
Let’s not get too far ahead of ourselves. Did it work?
>>> conn.get('secret')
b'icky-icky-icky-ptang-zoop-boing!'
Now, get a substring by using getrange() (as in Python, offset 0 means start, and -1
means end):
>>> conn.getrange('secret', -6, -1)
b'boing!'
Replace a substring by using setrange() (using a zero-based offset):
>>> conn.setrange('secret', 0, 'ICKY')
32
NoSQL Data Stores | 333
>>> conn.get('secret')
b'ICKY-icky-icky-ptang-zoop-boing!'
Next, set multiple keys at once by using mset():
>>> conn.mset({'pie': 'cherry', 'cordial': 'sherry'})
True
Get more than one value at once by using mget():
>>> conn.mget(['fever', 'carats'])
[b'101.5', b'24']
Delete a key by using delete():
>>> conn.delete('fever')
True
Increment by using the incr() or incrbyfloat() commands, and decrement with
decr():
>>> conn.incr('carats')
25
>>> conn.incr('carats', 10)
35
>>> conn.decr('carats')
34
>>> conn.decr('carats', 15)
19
>>> conn.set('fever', '101.5')
True
>>> conn.incrbyfloat('fever')
102.5
>>> conn.incrbyfloat('fever', 0.5)
103.0
There’s no decrbyfloat(). Use a negative increment to reduce the fever:
>>> conn.incrbyfloat('fever', -2.0)
101.0
Lists
Redis lists can contain only strings. The list is created when you do your first inser‐
tion. Insert at the beginning by using lpush():
>>> conn.lpush('zoo', 'bear')
1
Insert more than one item at the beginning:
>>> conn.lpush('zoo', 'alligator', 'duck')
3
Insert before or after a value by using linsert():
>>> conn.linsert('zoo', 'before', 'bear', 'beaver')
4
>>> conn.linsert('zoo', 'after', 'bear', 'cassowary')
5
Insert at an offset by using lset() (the list must exist already):
>>> conn.lset('zoo', 2, 'marmoset')
True
Insert at the end by using rpush():
>>> conn.rpush('zoo', 'yak')
6
Get the value at an offset by using lindex():
>>> conn.lindex('zoo', 3)
b'bear'
Get the values in an offset range by using lrange() (0 to -1 for all):
>>> conn.lrange('zoo', 0, 2)
[b'duck', b'alligator', b'marmoset']
Trim the list with ltrim(), keeping only those in a range of offsets:
>>> conn.ltrim('zoo', 1, 4)
True
Get a range of values (use 0 to -1 for all) by using lrange():
>>> conn.lrange('zoo', 0, -1)
[b'alligator', b'marmoset', b'bear', b'cassowary']
Chapter 15 shows you how you can use Redis lists and publish-subscribe to implement
job queues.
NoSQL Data Stores | 335
Hashes
Redis hashes are similar to Python dictionaries but can contain only strings. Also, you
can go only one level deep, not make deep-nested structures. Here are examples that
create and play with a Redis hash called song:
Set the fields do and re in hash song at once by using hmset():
>>> conn.hmset('song', {'do': 'a deer', 're': 'about a deer'})
True
Set a single field value in a hash by using hset():
>>> conn.hset('song', 'mi', 'a note to follow re')
1
Get one field’s value by using hget():
>>> conn.hget('song', 'mi')
b'a note to follow re'
Get multiple field values by using hmget():
>>> conn.hmget('song', 're', 'do')
[b'about a deer', b'a deer']
Get all field keys for the hash by using hkeys():
>>> conn.hkeys('song')
[b'do', b're', b'mi']
Get all field values for the hash by using hvals():
>>> conn.hvals('song')
[b'a deer', b'about a deer', b'a note to follow re']
Get the number of fields in the hash by using hlen():
>>> conn.hlen('song')
3
Get all field keys and values in the hash by using hgetall():
>>> conn.hgetall('song')
{b'do': b'a deer', b're': b'about a deer', b'mi': b'a note to follow re'}
Set a field if its key doesn’t exist by using hsetnx():
>>> conn.hsetnx('song', 'fa', 'a note that rhymes with la')
1
Sets
Redis sets are similar to Python sets, as you’ll see in the following examples.
Add one or more values to a set:
>>> conn.sadd('zoo', 'duck', 'goat', 'turkey')
3
Get the number of values from the set:
>>> conn.scard('zoo')
3
Get all of the set’s values:
>>> conn.smembers('zoo')
{b'duck', b'goat', b'turkey'}
Remove a value from the set:
>>> conn.srem('zoo', 'turkey')
True
Let’s make a second set to show some set operations:
>>> conn.sadd('better_zoo', 'tiger', 'wolf', 'duck')
0
Intersect (get the common members of ) the zoo and better_zoo sets:
>>> conn.sinter('zoo', 'better_zoo')
{b'duck'}
Get the intersection of zoo and better_zoo, and store the result in the set fowl_zoo:
>>> conn.sinterstore('fowl_zoo', 'zoo', 'better_zoo')
1
Who’s in there?
>>> conn.smembers('fowl_zoo')
{b'duck'}
Get the union (all members) of zoo and better_zoo:
>>> conn.sunion('zoo', 'better_zoo')
{b'duck', b'goat', b'wolf', b'tiger'}
Store that union result in the set fabulous_zoo:
>>> conn.sunionstore('fabulous_zoo', 'zoo', 'better_zoo')
4
>>> conn.smembers('fabulous_zoo')
{b'duck', b'goat', b'wolf', b'tiger'}
What does zoo have that better_zoo doesn’t? Use sdiff() to get the set difference,
and sdiffstore() to save it in the zoo_sale set:
>>> conn.sdiff('zoo', 'better_zoo')
{b'goat'}
>>> conn.sdiffstore('zoo_sale', 'zoo', 'better_zoo')
NoSQL Data Stores | 337
1
>>> conn.smembers('zoo_sale')
{b'goat'}
Sorted sets
One of the most versatile Redis data types is the sorted set, or zset. It’s a set of unique
values, but each value has an associated floating-point score. You can access each item
by its value or score. Sorted sets have many uses:
- Leader boards
- Secondary indexes
- Timeseries, using timestamps as scores
We show the last use case, tracking user logins via timestamps. We’re using the Unix
epoch value (more on this in Chapter 15) that’s returned by the Python time()
function:
>>> import time
>>> now = time.time()
>>> now
1361857057.576483
Let’s add our first guest, looking nervous:
>>> conn.zadd('logins', 'smeagol', now)
1
Five minutes later, another guest:
>>> conn.zadd('logins', 'sauron', now+(5*60))
1
Two hours later:
>>> conn.zadd('logins', 'bilbo', now+(2*60*60))
1
One day later, not hasty:
>>> conn.zadd('logins', 'treebeard', now+(24*60*60))
1
In what order did bilbo arrive?
>>> conn.zrank('logins', 'bilbo')
2
When was that?
>>> conn.zscore('logins', 'bilbo')
1361864257.576483
Let’s see everyone in login order:
>>> conn.zrange('logins', 0, -1)
[b'smeagol', b'sauron', b'bilbo', b'treebeard']
With their times, please:
>>> conn.zrange('logins', 0, -1, withscores=True)
[(b'smeagol', 1361857057.576483), (b'sauron', 1361857357.576483),
(b'bilbo', 1361864257.576483), (b'treebeard', 1361943457.576483)]
Caches and expiration
All Redis keys have a time-to-live, or expiration date. By default, this is forever. We
can use the expire() function to instruct Redis how long to keep the key. The value
is a number of seconds:
>>> import time
>>> key = 'now you see it'
>>> conn.set(key, 'but not for long')
True
>>> conn.expire(key, 5)
True
>>> conn.ttl(key)
5
>>> conn.get(key)
b'but not for long'
>>> time.sleep(6)
>>> conn.get(key)
>>>
The expireat() command expires a key at a given epoch time. Key expiration is use‐
ful to keep caches fresh and to limit login sessions. An analogy: in the refrigerated
room behind the milk racks at your grocery store, store employees yank those gallons
as they reach their freshness dates.
Document Databases
A document database is a NoSQL database that stores data with varying fields. Com‐
pared to a relational table (rectangular, with the same columns in every row) such
data is “ragged,” with varying fields (columns) per row, and even nested fields. You
could handle data like this in memory with Python dictionaries and lists, or store it as
JSON files. To store such data in a relational database table, you would need to define
every possible column and use nulls for missing data.
ODM can stand for Object Data Manager or Object Document Mapper (at least they
agree on the O part). An ODM is the document database counterpart of a relational
database ORM. Some popular document databases and tools (drivers and ODMs) are
listed in Table 16-6.
NoSQL Data Stores | 339
Table 16-6. Document databases
Database Python API
Mongo tools
DynamoDB boto3
CouchDB couchdb
PostgreSQL can do some of the things that document databases do.
Some of its extensions allow it to escape relational orthodoxy while
keeping features like transactions, data validation, and foreign keys:
1) multidimensional arrays—store more than one value in a table
cell; 2) jsonb—store JSON data in a cell, with full indexing and
querying.
Time Series Databases
Time series data may be collected at fixed intervals (such as computer performance
metrics) or at random times, which has led to many storage methods. Among many
of these, some with Python support are listed in Table 16-7.
Table 16-7. Temporal databases
Database Python API
InfluxDB influx-client
kdb+ PyQ
Prometheus prometheus_client
TimescaleDB (PostgreSQL clients)
OpenTSDB potsdb
PyStore PyStore
Graph Databases
For our last case of data that need its own database category, we have graphs: nodes
(data) connected by edges or vertices (relationships). An individual Twitter user could
be a node, with edges to other users like following and followed.
Graph data has become more visible with the growth of social media, where the value
is in the connections as much as the content. Some popular graph databases are out‐
lined in Table 16-8.
Table 16-8. Graph databases
Database Python API
Neo4J py2neo
Database Python API
OrientDB pyorient
ArangoDB pyArango
Other NoSQL
The NoSQL servers listed here handle data larger than memory, and many of them
use multiple computers. Table 16-9 presents notable servers and their Python
libraries.
Table 16-9. NoSQL databases
Database Python API
Cassandra pycassa
CouchDB couchdb-python
HBase happybase
Kyoto Cabinet kyotocabinet
MongoDB mongodb
Pilosa python-pilosa
Riak riak-python-client
Full-Text Databases
Finally, there’s a special category of databases for full-text search. They index every‐
thing, so you can find that poem that talks about windmills and giant wheels of
cheese. You can see some popular open source examples along with their Python
APIs in Table 16-10.
Table 16-10. Full-text databases
Site Python API
Lucene pylucene
Solr SolPython
ElasticSearch elasticsearch
Sphinx sphinxapi
Xapian xappy
Whoosh (written in Python, includes an API)
Full-Text Databases | 341
Coming Up
The previous chapter was about interleaving code in time (concurrency). The next one
is about moving data through space (networking), which can be used for concurrency
and other reasons.
Things to Do
16.1 Save the following text lines to a file called books.csv (notice that if the fields are
separated by commas, you need to surround a field with quotes if it contains a
comma):
author,book
J R R Tolkien,The Hobbit
Lynne Truss,"Eats, Shoots & Leaves"
16.2 Use the csv module and its DictReader method to read books.csv to the variable
books. Print the values in books. Did DictReader handle the quotes and commas in
the second book’s title?
16.3 Create a CSV file called books2.csv by using these lines:
title,author,year
The Weirdstone of Brisingamen,Alan Garner,1960
Perdido Street Station,China Miéville,2000
Thud!,Terry Pratchett,2005
The Spellman Files,Lisa Lutz,2007
Small Gods,Terry Pratchett,1992
16.4 Use the sqlite3 module to create a SQLite database called books.db and a table
called books with these fields: title (text), author (text), and year (integer).
16.5 Read books2.csv and insert its data into the book table.
16.6 Select and print the title column from the book table in alphabetical order.
16.7 Select and print all columns from the book table in order of publication.
16.8 Use the sqlalchemy module to connect to the sqlite3 database books.db that you
just made in exercise 16.4. As in 16.6, select and print the title column from the
book table in alphabetical order.
16.9 Install the Redis server and the Python redis library (pip install redis) on
your computer. Create a Redis hash called test with the fields count ( 1 ) and name
(‘Fester Bestertester’). Print all the fields for test.
16.10 Increment the count field of test and print it.