This is a collection of scripts to generate machine learning training images using OpenStreetMap data and satellite imagery providers. The scripts scan through certain objects in a specified area (defined by bounding box) and generate a .tar file with images, labeled by folder names. Upload that to your NVidia farm and train.
I tend to use Maxar layer because it's recent, crispy, and shot at almost vertical angle. At other locations results may vary. You can add new imagery providers to layers.py
.
The training uses resnet34 architecture with fast.ai library. Every table below lists error_rate
metric, which is just a percentage of incorrect predictions on the validation set.
If you have a high error rate, fast.ai provides a great debugging tool top_losses()
. It will give you images that confuse the net for some reason. It translates into unmapped areas or outdated imagery (or a bug in data collection script).
Task: Classify imagery tiles with streetlamps vs. no streetlamps.
Satellite imagery providers serve data in 256x256 tiles. The first approach is just fetch a tile which contains a lamp and use that as a positive example. Every tile that does not contain a high-level road (highway=tertiary and up) is supposed to be negative.
The problem is some lamps are at the tile edge and possibly cross the boundary. Sometimes imagery offset makes the object appear on a different tile than it should, which produces false positive example.
What I did is drop all positive examples where the base of street lamp is less than 16px away from tile edge. Dealing with offset is tricky, as it depends on both imagery properties and OSM mappers in the area. I did eyeball an average offset for my area, but for other cities it can be anything. You can use video.py
to do that, it lets you look through lots of imagery quickly.
from fastai.vision import *
from fastai.metrics import error_rate
path = # script output
tfms = get_transforms(do_flip=False, max_warp=0, max_zoom=0, max_rotate=0)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.1, ds_tfms=tfms, size=224)
learn = cnn_learner(data, models.resnet34, metrics=error_rate)
# LR is learning rate, use learn.lr_find() to estimate
learn.fit_one_cycle(4, max_lr=7e-2)
learn.unfreeze()
# slice notation means linear increase between deepest and top layer
learn.fit_one_cycle(4, max_lr=slice(2e-6,2e-4))
learn.fit_one_cycle(4, max_lr=slice(4e-7,4e-5))
This dataset converges to 3% error. There must be another way to do this!
Frozen, 4 epoch train | Unfreeze, 4 epochs | 4 more | |
---|---|---|---|
z18 | 3.6-4.0% | 3.4-4.0% | 3.2-3.6% |
z19 | 4.6-5.3% | 4.2-4.7% | 4.1-4.7% |
The best thing about satellite imagery is that it's huge and can be scrolled in every direction almost infinitely. If you need more information about a location you can always look at adjacent tiles. This method just fetches a larger square of N pixels around every known streetlamp. These will get randomly cropped later in training process. Negative examples are expanded too, just for consistency.
# warning! read below
tfms = get_transforms(do_flip=False, max_warp=0, max_zoom=0, max_rotate=0)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2, ds_tfms=tfms, size=256)
Results:
Frozen, 1 epoch train | Unfreeze, 2 epochs | 2 more epochs | |
---|---|---|---|
356px | 3.7% | 3.9% | 3.7% |
432px | 3.7% | 3.6% | 3.3% |
The thing is, when I looked closer, all these runs had input data resized, not cropped to 256px, which means no augmentation. Both training and validation lamps are in the middle which means the convnet is trained to detect a single dark line in the center of image which is trivial. I would say 3% error is too much for that task.
Training set cropping seems important for real-life applications, because input at inference time will include streetlamps at any part of a tile, not just center of it. The intuition behind that is the network should encounter as much variance in train data as possible. Validation set still contains original tiles. Random cropping turned out tricky in fast.ai, here's what I came up with:
# crops random 256x256 piece out of train images;
# validation images are not transformed
tfms = [[crop(size=256, row_pct=(0,1), col_pct=(0,1))], []]
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.1, ds_tfms=tfms, size=None)
The problem is, the performance is now worse than before:
Frozen, 1 epoch train | Unfreeze, 2 epochs | 2 more epochs | |
---|---|---|---|
256px | 4.8% | 3.9% | 3.8% |
356px | 5.7% | 4.2% | 3.9% |
432px | 7.3% | 5.5% | 4.7% |
512px | 8.3% | 6.4% | 7.1% |
Just some mysterious stuff I discovered by accident.
tfms = get_transforms(do_flip=False, max_warp=0, max_zoom=0, max_rotate=0)
data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2, ds_tfms=tfms, size=None)
No resize and no crop leads to exceptional results:
Frozen, 1 epoch train | Unfreeze, 2 epochs | 2 more epochs | |
---|---|---|---|
354px | 1.0-2.2% | 2.5% | 2.4% |
356px | 0.3-0.8% | 0.25-0.35% | 0.2-0.3% |
356px | 0.5%-0.6% | 0.3% | 0.3% |
358px | 1.0-1.5% | 0.8% | 0.8% |
One specific input size is doing too well. 0.2% over 2000 validation images only gives 4 incorrect guesses! But that is pretty much expected for a task of detecting a dark vertical line. I did generate another batch of images of same size, and it also works wonders. But why is every other input image size performing so much worse?
I have no idea what's going on, but it stops with any other size
parameter value. Even though the training timing with size=None
suggests it uses 299px images, size=299
does not work the same. Looks like the resize is handled in some special way behind the scenes. Need to dig into fast.ai code to find the root cause.
Task: classify roof:shape=gabled, hipped, or flat.
Input: varies (see table). As expected, the more the better!
Input size | Frozen, 8 epoch train | Unfreeze, 8 epochs | 8 more epochs |
---|---|---|---|
712/712/712 | 6.3% | 5.6% | 5.1% |
715/1469/4262 | 5.6% | 4.9% | 4.6% |
5.9% | 4.9% | 4.3% | |
882/1403/4022 | 4.2% | 2.8% | 3.0% |
4.3% | 3.6% | 3.5% | |
4.9% | 3.2% | 3.0% |
Task: classify tiles with any type of building(s) vs. tiles with no buildings at all.
For every node of every building, fetch the tile it belongs to. Don't use the tile if the node is closer than 16px to the edge (that would present only a tiny part of the building). This script also accepts "excluded areas" input where satellite imagery is outdated, to minimize false training examples.
Input: 5700 images of both categories, 20% validation.
Frozen, 4 epoch train | Unfreeze, 4 epochs | |
---|---|---|
size=256, no flip, max_rotate=0 | ||
size=256, no flip, max_rotate=20 | ||
size=256, no flip, max_rotate=40 | 2.1% | 1.9% |
Input: 10000 images of both categories, 20% validation.
Frozen, 4 epoch train | Unfreeze, 4 epochs | |
---|---|---|
size=256, no flip, max_rotate=0 | ||
size=256, no flip, max_rotate=20 | ||
size=256, no flip, max_rotate=40 | 1.3% | 1.3% |
2.0 | 2.0 | |
1.9 | 1.7 | |
2.0 | 1.9 | |
1.8 | 1.6 |