21장 Py at Work
Table of contents
- 467
- The Microsoft Office Suite
- Carrying Out Business Tasks
- Processing Business Data
- Open Source Python Business Packages
- Python in Finance
- Business Data Security
- Maps
- Coming Up
- Things to Do
- Py at Work.
- The Microsoft Office Suite
- Carrying Out Business Tasks
- Processing Business Data
- Extracting, Transforming, and Loading
- Data Validation
- Additional Sources of Information - Open Source Python Business Packages - Python in Finance - Business Data Security - Maps
- Formats
- Draw a Map from a Shapefile
- Geopandas
- Other Mapping Packages
- Applications and Data - Coming Up - Things to Do
- Py at Work.
CHAPTER 21 Py at Work
“Business!” cried the Ghost, wringing its hands again. “Mankind was my business...”
—Charles Dickens, A Christmas Carol
The businessman’s uniform is a suit and tie. But before he can get down to business, he
tosses his jacket over a chair, loosens his tie, rolls up his sleeves, and pours some cof‐
fee. Meanwhile, the business woman has already been getting work done. Maybe with
a latte.
In business and government, we use all of the technologies from the earlier chapters
—databases, the web, systems, and networks. Python’s productivity is making it more
popular in the enterprise and with startups.
Organizations have long fought incompatible file formats, arcane network protocols,
language lock-in, and the universal lack of accurate documentation. They can create
faster, cheaper, stretchier applications by using with tools such as these:
- Dynamic languages like Python
- The web as a universal graphical user interface
- RESTful APIs as language-independent service interfaces
- Relational and NoSQL databases
- “Big data” and analytics
- Clouds for deployment and capital savings
467
The Microsoft Office Suite
Business is heavily dependent on Microsoft Office applications and file formats.
Although they are not well known, and in some cases poorly documented, there are
some Python libraries that can help. Here are some that process Microsoft Office
documents:
docx
This library creates, reads, and writes Microsoft Office Word 2007 .docx files.
python-excel
This one discusses the xlrd, xlwt, and xlutils modules via a PDF tutorial. Excel
can also read and write comma-separated values (CSV) files, which you know
how to process by using the standard csv module.
oletools
This library extracts data from Office formats.
OpenOffice is an open source alternative to Office. It runs on Linux, Unix, Windows,
and macOS, and reads and writes Office file formats, It also installs a version of
Python 3 for its own use. You can program OpenOffice in Python with the PyUNO
library.
OpenOffice was owned by Sun Microsystems, and when Oracle acquired Sun, some
people feared for its future availability. LibreOffice was spun off as a result. Docu‐
mentHacker describes using the Python UNO library with LibreOffice.
OpenOffice and LibreOffice had to reverse engineer the Microsoft file formats, which
is not easy. The Universal Office Converter module depends on the UNO library in
OpenOffice or LibreOffice. It can convert many file formats: documents, spread‐
sheets, graphics, and presentations.
If you have a mystery file, python-magic can guess its format by analyzing specific
byte sequences.
The python open document library lets you provide Python code within templates to
create dynamic documents.
Although not a Microsoft format, Adobe’s PDF is very common in business.
ReportLab has open source and commercial versions of its Python-based PDF gener‐
ator. If you need to edit a PDF, you might find some help at StackOverflow.
Carrying Out Business Tasks
You can find a Python module for almost anything. Visit PyPI and type something
into the search box. Many modules are interfaces to the public APIs of various serv‐
ices. You might be interested in some examples related to business tasks:
**468 | Chapter 21: Py at Work** |
- Ship via Fedex or UPS.
- Mail with the stamps.com API.
- Read a discussion of Python for business intelligence.
- If Aeropresses are flying off the shelves in Anoka, was it customer activity or pol‐ tergeists? Cubes is an Online Analytical Processing (OLAP) web server and data browser.
- OpenERP is a large commercial Enterprise Resource Planning (ERP) system written in Python and JavaScript, with thousands of add-on modules.
Processing Business Data
Businesses have a particular fondness for data. Sadly, many of them conjure up per‐
verse ways of making data harder to use.
Spreadsheets were a good invention, and over time businesses became addicted to
them. Many nonprogrammers were tricked into programming because they were
called macros instead of programs. But the universe is expanding and data is trying to
keep up. Older versions of Excel were limited to 65,536 rows, and even newer ver‐
sions choke at a million or so. When an organization’s data outgrow the limits of a
single computer, it’s like headcount growing past a hundred people or so—suddenly
you need new layers, intermediaries, and communication.
Excessive data programs aren’t caused by the size of data on single desktops; rather,
they’re the result of the aggregate of data pouring into the business. Relational data‐
bases handle millions of rows without exploding, but only so many writes or updates
at a time. A plain old text or binary file can grow gigabytes in size, but if you need to
process it all at once, you need enough memory. Traditional desktop software isn’t
designed for all this. Companies such as Google and Amazon had to invent solutions
to handle so much data at scale. Netflix is an example built on Amazon’s AWS cloud,
using Python to glue together RESTful APIs, security, deployment, and databases.
Extracting, Transforming, and Loading
The underwater portions of the data icebergs include all the work to get the data in
the first place. If you speak enterprise, the common term is extract, transform, load,
or ETL. Synonyms such as data munging or data wrangling give the impression of
taming an unruly beast, which might be apt metaphors. This would seem to be a
solved engineering matter by now, but it remains largely an art. I talked a bit about
this in Chapter 12. We address data science more broadly in Chapter 22, because this
is where most developers spend a large part of their time.
Processing Business Data | 469
If you’ve seen The Wizard of Oz, you probably remember (besides the flying mon‐
keys) the part at the end—when the good witch told Dorothy that she could always go
home to Kansas just by clicking her ruby slippers. Even when I was young I thought,
“Now she tells her!” Although, in retrospect, I realize the movie would have been
much shorter if she’d shared that tip earlier.
But this isn’t a movie; we’re talking about the world of business here, where making
tasks shorter is a good thing. So, let me share some tips with you now. Most of the
tools that you need for day-to-day data work in business are those that you’ve already
read about here. Those include high-level data structures such as dictionaries and
objects, thousands of standard and third-party libraries, and an expert community
that’s just a google away.
If you’re a computer programmer working for some business, your workflow almost
always includes the following:
- Extracting data from weird file formats or databases
- “Cleaning up” the data, which covers a lot of ground, all strewn with pointy objects
- Converting things like dates, times, and character sets
- Actually doing something with the data
- Storing resulting data in a file or database
- Rolling back to step 1 again; lather, rinse, repeat
Here’s an example: you want to move data from a spreadsheet to a database. You can
save the spreadsheet in CSV format and use the Python libraries from Chapter 16. Or,
you can look for a module that reads the binary spreadsheet format directly. Your fin‐
gers know how to type python excel into Google and find sites such as Working
with Excel files in Python. You can install one of the packages by using pip, and
locate a Python database driver for the last part of the task. I mentioned SQLAlchemy
and the direct low-level database drivers in that same chapter. Now you need some
code in the middle, and that’s where Python’s data structures and libraries can save
you time.
Let’s try an example here, and then we’ll try again with a library that saves a few steps.
We’ll read a CSV file, aggregate the counts in one column by unique values in
another, and print the results. If we did this in SQL, we would use SELECT, JOIN,
and GROUP BY.
First, the file, zoo.csv, which has columns for the type of animal, how many times it
has bitten a visitor, the number of stitches required, and how much we’ve paid the
visitor not to tell local television stations:
**470 | Chapter 21: Py at Work** |
animal,bites,stitches,hush
bear,1,35,300
marmoset,1,2,250
bear,2,42,500
elk,1,30,100
weasel,4,7,50
duck,2,0,10
We want to see which animal is costing us the most, so we aggregate the total hush
money by the type of animal. (We’ll leave bites and stitches to an intern.) We use the
csv module from “CSV” on page 304 and Counter from “Count Items with
Counter()” on page 209. Save this code as zoo_counts.py:
import csv
from collections import Counter
counts = Counter()
with open('zoo.csv', 'rt') as fin:
cin = csv.reader(fin)
for num, row in enumerate(cin):
if num > 0:
counts[row[0]] += int(row[-1])
for animal, hush in counts.items():
print ("%10s %10s" % (animal, hush))
We skipped the first row because it contained only the column names. counts is a
Counter object, and takes care of initializing the sum for each animal to zero. We also
applied a little formatting to right-align the output. Let’s try it:
$ python zoo_counts.py
duck 10
elk 100
bear 800
weasel 50
marmoset 250
Hah! It was the bear. He was our prime suspect all along, but now we have the
numbers.
Next, let’s replicate this with a data processing toolkit called Bubbles. You can install it
by typing this command:
$ pip install bubbles
It requires SQLAlchemy; if you don’t have that, pip install sqlalchemy will do the
trick. Here’s the test program (call it bubbles1.py), adapted from the documentation:
import bubbles
p = bubbles.Pipeline()
p.source(bubbles.data_object('csv_source', 'zoo.csv', infer_fields=True))
p.aggregate('animal', 'hush')
p.pretty_print()
Processing Business Data | 471
And now, the moment of truth:
$ python bubbles1.py
2014-03-11 19:46:36,806 DEBUG calling aggregate(rows)
2014-03-11 19:46:36,807 INFO called aggregate(rows)
2014-03-11 19:46:36,807 DEBUG calling pretty_print(records)
+--------+--------+------------+
|animal |hush_sum|record_count|
+--------+--------+------------+
|duck | 10| 1|
|weasel | 50| 1|
|bear | 800| 2|
|elk | 100| 1|
|marmoset| 250| 1|
+--------+--------+------------+
2014-03-11 19:46:36,807 INFO called pretty_print(records)
If you read the documentation, you can avoid those debug print lines, and maybe
change the format of the table.
Looking at the two examples, we see that the bubbles example used a single function
call (aggregate) to replace our manual reading and counting of the CSV format.
Depending on your needs, data toolkits can save a lot of work.
In a more realistic example, our zoo file might have thousands of rows (it’s a danger‐
ous place), with misspellings such as bare, commas in numbers, and so on. For good
examples of practical data problems with Python code, I’d also recommend the
following:
- Data Crunching: Solve Everyday Problems Using Java, Python, and More—Greg Wilson (Pragmatic Bookshelf ).
- Automate the Boring Stuff—Al Sweigart (No Starch).
Data cleanup tools can save a lot of time, and Python has many of them. For another
example, PETL does row and column extraction and renaming. Its related work page
lists many useful modules and products. Chapter 22 has detailed discussions of some
especially useful data tools: Pandas, NumPy, and IPython. Although they’re currently
best known among scientists, they’re becoming popular among financial and data
developers. At the 2012 Pydata conference, AppData discussed how these three and
other Python tools help process 15 terabytes of data daily. Python handles very large
real-world data loads.
You may also look back at the data serialization and validation tools discussed in
“Data Serialization” on page 361.
Data Validation
When cleaning up data, you’ll often need to check:
**472 | Chapter 21: Py at Work** |
- Data type, such as integer, float, or string
- Range of values
- Correct values, such as a working phone number or email address
- Duplicates
- Missing data
This is especially common when processing web requests and responses.
Useful Python packages for particular data types include:
- validate_email
- phonenumber
Some useful general tools are:
- validators
- pydantic—For Python 3.6 and above; uses type hints
- marshmallow—Also serializes and deserializes
- cerberus
- many others
Additional Sources of Information
Sometimes, you need data that originates somewhere else. Some business and gov‐
ernment data sources include:
data.gov
A gateway to thousands of data sets and tools. Its APIs are built on CKAN, a
Python data management system.
Opening government with Python
See the video and slides.
python-sunlight
Libraries to access the Sunlight APIs.
froide
A Django-based platform for managing freedom of information requests.
30 places to find open data on the web
Some handy links.
Processing Business Data | 473
Open Source Python Business Packages
Odoo
Extensive ERP platform
Tryton
Another extensive business platform
Oscar
Ecommerce framework for Django
Grid Studio
Python-based spreadsheet, runs locally or in the cloud
Python in Finance
Recently, the financial industry has developed a great interest in Python. Adapting
software from Chapter 22 as well as some of their own, quants are building a new
generation of financial tools:
Quantitative economics
A tool for economic modeling, with lots of math and Python code
Python for finance
Features the book Derivatives Analytics with Python: Data Analytics, Models, Sim‐
ulation, Calibration, and Hedging by Yves Hilpisch (Wiley)
Quantopian
An interactive website on which you can write your own Python code and run it
against historic stock data to see how it would have done
PyAlgoTrade
Another that you can use for stock backtesting, but on your own computer
Quandl
Search millions of financial datasets
Ultra-finance
A real-time stock collection library
Python for Finance (O’Reilly)
A book by Yves Hilpisch with Python examples for financial modeling
Business Data Security
Security is a special concern for business. Entire books are devoted to this topic, so we
just mention a few Python-related tips here.
**474 | Chapter 21: Py at Work** |
- “Scapy” on page 349 discusses scapy, a Python-powered language for packet for‐ ensics. It has been used to explain some major network attacks.
- The Python Security site has discussions of security topics, details on some Python modules, and cheat sheets.
- The book Violent Python (subtitled A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers) by TJ O’Connor (Syngress) is an extensive review of Python and computer security.
Maps
Maps have become valuable to many businesses. Python is very good at making
maps, so we’re going to spend a little more time in this area. Managers love graphics,
and if you can quickly whip up a nice map for your organization’s website it wouldn’t
hurt.
In the early days of the web, I used to visit an experimental mapmaking website at
Xerox. When big sites such as Google Maps came along, they were a revelation (along
the lines of “why didn’t I think of that and make millions?”). Now mapping and
location-based services are everywhere, and are particularly useful in mobile devices.
Many terms overlap here: mapping, cartography, GIS (geographic information sys‐
tem), GPS (Global Positioning System), geospatial analysis, and many more. The blog
at Geospatial Python has an image of the “800-pound gorilla” systems—GDAL/OGR,
GEOS, and PROJ.4 (projections)—and surrounding systems, represented as mon‐
keys. Many of these have Python interfaces. Let’s talk about some of these, beginning
with the simplest formats.
Formats
The mapping world has lots of formats: vector (lines), raster (images), metadata
(words), and various combinations.
Esri, a pioneer of geographic systems, invented the shapefile format over 20 years ago.
A shapefile actually consists of multiple files, including at the very least the following:
.shp
The “shape” (vector) information
.shx
The shape index
.dbf
An attribute database
Maps | 475
Let’s grab a shapefile for our next example—visit the Natural Earth 1:110m Cultural
Vectors page. Under “Admin 1 - States and Provinces,” click the green download
states and provinces box to download a zip file. After it downloads to your computer,
unzip it; you should see these resulting files:
ne_110m_admin_1_states_provinces_shp.README.html
ne_110m_admin_1_states_provinces_shp.sbn
ne_110m_admin_1_states_provinces_shp.VERSION.txt
ne_110m_admin_1_states_provinces_shp.sbx
ne_110m_admin_1_states_provinces_shp.dbf
ne_110m_admin_1_states_provinces_shp.shp
ne_110m_admin_1_states_provinces_shp.prj
ne_110m_admin_1_states_provinces_shp.shx
We’ll use these for our examples.
Draw a Map from a Shapefile
This section is an overly simplified demonstration of reading and displaying a shape‐
file. You’ll see that the result has problems, and that you’d be better off working with a
higher-level mapping package, such as those in the sections that follow.
You’ll need this library to read a shapefile:
$ pip install pyshp
Now for the program, map1.py, which I’ve modified from a Geospatial Python blog
post:
def display_shapefile(name, iwidth=500, iheight=500):
import shapefile
from PIL import Image, ImageDraw
r = shapefile.Reader(name)
mleft, mbottom, mright, mtop = r.bbox
# map units
mwidth = mright - mleft
mheight = mtop - mbottom
# scale map units to image units
hscale = iwidth/mwidth
vscale = iheight/mheight
img = Image.new("RGB", (iwidth, iheight), "white")
draw = ImageDraw.Draw(img)
for shape in r.shapes():
pixels = [
(int(iwidth - ((mright - x) * hscale)), int((mtop - y) * vscale))
for x, y in shape.points]
if shape.shapeType == shapefile.POLYGON:
draw.polygon(pixels, outline='black')
elif shape.shapeType == shapefile.POLYLINE:
draw.line(pixels, fill='black')
img.show()
476 | Chapter 21: Py at Work
if __name__ == '__main__':
import sys
display_shapefile(sys.argv[1], 700, 700)
This reads the shapefile and iterates through its individual shapes. I’m checking for
only two shape types: a polygon, which connects the last point to the first, and a poly‐
line, which doesn’t. I’ve based my logic on the original post and a quick look at the
documentation for pyshp, so I’m not really sure how it will work. Sometimes, we just
need to make a start and deal with any problems as we find them.
So, let’s run it. The argument is the base name of the shapefile files, without any
extension:
$ python map1.py ne_110m_admin_1_states_provinces_shp
You should see something like Figure 21-1.
Well, it drew a map that resembles the United States, but:
- It looks like a cat dragged yarn across Alaska and Hawaii; this is a bug.
- The country is squished; I need a projection.
- The picture isn’t pretty; I need better style control.
Maps | 477
Figure 21-1. Preliminary map
To address the first point: I have a problem somewhere in my logic, but what should I
do? Chapter 19 discusses development tips, including debugging, but we can consider
other options here. I could write some tests and bear down until I fix this, or I could
just try some other mapping library. Maybe something at a higher level would solve
all three of my problems (the stray lines, squished appearance, and primitive style).
As far as I can tell, there is no bare-bones pure Python mapping package. Luckily,
there are some fancier ones, so let’s take a look.
**478 | Chapter 21: Py at Work** |
Geopandas
Geopandas integrates matplotlib, pandas, and other Python libraries into a geospa‐
tial data platform.
The base package is installed with the familiar pip install geopandas, but it relies
on other packages that you will also need to install with pip if you don’t have them
already:
- numpy
- pandas (version 0.23.4 or later)
- shapely (interface to GEOS)
- fiona (interface to GDAL)
- pyproj (interface to PROJ)
- six
Geopandas can read shapefiles (including those from the previous section), and
handily includes two from Natural Earth: country/continent outlines, and national
capital cities. Example 21-1 is a simple demo that uses both of them.
Example 21-1. geopandas.py
import geopandas import matplotlib.pyplot as plt
world_file = geopandas.datasets.get_path(‘naturalearth_lowres’) world = geopandas.read_file(world_file) cities_file = geopandas.datasets.get_path(‘naturalearth_cities’) cities = geopandas.read_file(cities_file) base = world.plot(color=’orchid’) cities.plot(ax=base, color=’black’, markersize=2) plt.show()
Run that, and you should get the map shown in Figure 21-2.
Maps | 479
Figure 21-2. Geopandas map
To me, geopandas currently looks like the best combination for geographic data man‐
agement and display. But there are many worthy contenders, which we look at in the
next section.
Other Mapping Packages
Here’s a grab bag of links to other Python mapping software; many cannot be
installed fully with pip, but some can with conda (the alternative Python package
installer that’s especially handy for scientific software):
pyshp
A pure-Python shapefile library, mentioned earlier in “Draw a Map from a
Shapefile” on page 476.
kartograph
Renders shapefiles into SVG maps on the server or client.
**480 | Chapter 21: Py at Work** |
shapely
Addresses geometric questions such as, “What buildings in this town are within
the 50-year flood contour?”
basemap
Based on matplotlib, draws maps and data overlays. Unfortunately, it has been
deprecated in favor of Cartopy.
cartopy
Succeeds Basemap, and does some of the things that geopandas does.
folium
Works with leaflet.js, used by geopandas.
plotly
Another plotting package that includes mapping features.
dash
Uses Plotly, Flask and JavaScript to create interactive visualizations, including
maps.
fiona
Wraps the OGR library, which handles shapefiles and other vector formats.
Open Street Map
Accesses the vast OpenStreetMap world maps.
mapnik
A C++ library with Python bindings, for vector (line) and raster (image) maps.
Vincent
Translates to Vega, a JavaScript visualization tool; see the tutorial: Mapping data
in Python with pandas and vincent.
Python for ArcGIS
Links to Python resources for Esri’s commercial ArcGIS product.
Using geospatial data with python
Video presentations.
So you’d like to make a map using Python
Uses pandas, matplotlib, shapely, and other Python modules to create maps of
historic plaque locations.
Python Geospatial Development (Packt)
A book by Eric Westra with examples using mapnik and other tools.
Maps | 481
Learning Geospatial Analysis with Python (Packt)
Another book by Joel Lawhead reviewing formats and libraries, with geospatial
algorithms.
geomancer
Geospatial engineering, such as the distance from a point to the nearest Irish bar.
If you’re interested in maps, try downloading and installing one of these packages and
see what you can do. Or, you can avoid installing software and try connecting to a
remote web service API yourself; Chapter 18 shows you how to connect to web
servers and decode JSON responses.
Applications and Data
I’ve been talking about drawing maps, but you can do a lot more with map data. Geo‐
coding converts between addresses and geographic coordinates. There are many geo‐
coding APIs (see ProgrammableWeb’s comparison) and Python libraries:
- geopy
- pygeocoder
- googlemaps
If you sign up with Google or another source to get an API key, you can access other
services such as step-by-step travel directions or local search.
Here are a few sources of mapping data:
http://www.census.gov/geo/maps-data
Overview of the US Census Bureau’s map files
http://www.census.gov/geo/maps-data/data/tiger.html
Heaps of geographic and demographic map data
http://wiki.openstreetmap.org/wiki/Potential_Datasources
Worldwide sources
http://www.naturalearthdata.com
Vector and raster map data at three scales
I should mention the Data Science Toolkit here. It includes free bidirectional geocod‐
ing, coordinates to political boundaries and statistics, and more. You can also down‐
load all the data and software as a virtual machine (VM) and run it self-contained on
your own computer.
**482 | Chapter 21: Py at Work** |
Coming Up
We go to a Science Fair and see all the Python exhibits.
Things to Do
21.1 Install geopandas and run Example 21-1. Try modifying things like colors and
marker sizes.
Coming Up | 483