OpenStreetMap Foundation, Operations Meeting* - Agenda & Minutes
Monday May 4th 2020, 21:00 London time
Location: Video room at https://osmvideo.cloud68.co
- Please note that this was not strictly an OWG meeting.
- Grant Slater (OWG) (joined ~ 45' after start)
- Tom Hughes (OWG) (joined ~ 60' after start)
- Paul Norman (OWG, board)
- Emilie Laffray (OWG)
- Guillaume Rischard (board)
- Hrvoje Bogner (OWG)
- Ian Dees
- Michal Migurski
- Rory McCann (board)
Minutes by Dorothea.
<link to be added>
Commercial CDN for Bulk Tile Users
Fastly has expressed interest in supporting OSM.
(carried over from previous meeting)
- Ian talked to CFO.
- Initial estimate based on graph on NGINX, assumed req/min not req/sec.
- Passed new number along and they said they will look into it.
Open source grant support: they come up with amount of USD credits that get applied to the account monthly and then anything above that gets billed to the credit card on file. Then they watch that and they try to make the number go down over time or they try to make the credit go up.
Default offering: shared SSL cert. Grant would like to have his own cert.
They support multiple back ends. We can tell them what the region the server is in and they'll do the backend balancing. We can also set up back-end health checks so that if one rendering server goes down, it will flip over to another one even if it's higher latency.
If we go to Fastly that would render the question of squid and nginx selection moot.
- If Fastly is interested to write the amount off on their taxes as a donation, OSMUS are ok with that option (Ed talked to OSMUS).
- They are interested in "co-branding" (blogpost).
Squid / nginx / GeoDNS? Is the software right? Issues and improvement.
(carried over from previous meeting)
Open Ops Tickets
Review open, what needs policy and what needs someone to help with.
API keys might be necessary to know about our users and be able to enforce our tile policies.
- Implemented a couple API key like checking methods based off logs at Mapzen: ran process every 24 hours that grabbed the logs (used Athena/AWS- because the logs were in S3), to run through all the logs and count the number of requests by API key and if they found one that was over the count for the day they blocked them the next day, so that's one option.
By having a check in the request path for an API key would reduce the amount of traffic going to the render servers and somewhat to caches by quite a lot.
Blacklisting falls weak to using random strings as API keys.
- If we had API keys, it would be nice to have them for Nominatim too.
Other points mentioned during discussion
- Fastly would handle caching better.
- If we changed the TTL for request that came from unknown hosts to something that's much longer, in theory we'd get fewer requests.
- We still need to be able to block rogue apps that are just scraping the tiles.
- Currently we have no way to contact the users of our service.
- Paul has spent hundreds of hours over the years doing outreach to people by figuring out their referers.
Potential hidden costs of API keys
- Need people to implement them and run user accounts. Might not be possible on voluntary basis.
- People asking us about quality of service (# we already get these questions).
- Influx of emails for support.
- People starting to expect more support.
- Community freaking out that OSMF "is going commercial".
- Whitelisting will require a lot of time to kick off abusers.
- People tend to use no contactable/throw-away emails if asked for during registration.
- Combine blacklisting of people who do not have API keys and whitelisting people who contact us.
If we hear positively from Fastly
- Do a test service,
- Try referral-based TTLs.
- Then possibly a region if if we could segment the traffic to that level.
Johan is working on it.
We are hitting the limits of Munin. Prometheus was suggested as an alternative but is a lot of work to get the same monitoring and alerting coverage we have right now.
Mentioned during discussion:
- You have to have a template for everything.
- a lot of work to get it to scale at a reasonable level.
- Google cloud.
Emilie has experience with Prometheus.
Featured layers call
Some of the initial expressions of interest:
- didn't want to deal with the load.
- weren't confident that their systems would handle the load.
- didn't reply.
Any other business
On fixing the failing CI tests
- Guillaume got it to work by switching to Mac OS. Ubuntu doesn't work.
- Guillaume contacted GitHub people who were quite interested in the problem.
- Probably not due to TCP options
Tom compared the TCP options between what we get from them and what he gets from his machine and they were the same. However, it might be that for Mac's there are different options and those don't trigger it or it might be a different data center.
- If we decide to move things to AMS, we have a formerly wiki server which has a direct upgrade path to more modern machines.
Decision: give a few days to Github to respond.
- Paul and Tom to create a ticket so that Grant can point to it on his email to UCL.
- Grant to email UCL about the firewall issue. If we don't hear back in 1-2 days, get a server running in AMS [potentially the old wiki server] and move some stuff.