Board/Minutes/2020-03-F2F/Dialogue with OSM System Administrators

From OpenStreetMap Foundation

Draft.

Post-meeting notes

Goals of board for discussion

  • Show appreciation for sysadmins' work.
  • Find out how the board can support sysadmins.

Tiles

Current situation

  • Got 30-40 nodes, traditionally donated by 3rd parties.
  • There's an order of magnitude difference between how well a good tile cache performs and how well a poor tile cache performs.
  • Different regions perpetually overloaded.
  • Events, like the london marathon, where they unexpectedly used the tiles - huge amount of traffic.
  • tile.openstreetmap.org - peaked at 0.4 million requests/sec - in many companies that would mean on-call rotation.
  • We are well architectured in terms of operational load that we can frequently ignore things, but probably not enough people.
  • Constant control and rebalance that traffic across the available resources. We have something in place to try to do this automatically but it doesn't work brilliantly.
  • There's perpetually work to do on it and there is always revision work to do on it.
  • Architecture is not well designed - it definitely could use more input - but the problem is that it's an exponential thing: more and more work to run better.

Traffic:

  • ~ 30%: osm.org and direct friends.
  • ~ 70%: other people all over the world. Estate agents, companies tracking their buses for anyone that visits their website (e.g. Greyhound buses).

On contacting people using our tiles

  • Time consuming and painful process. Each takes ~ 20 min to find them, draft the email, ask them to stop using, follow up a few days later and reply.

To overhaul the architecture would require additional bodies/competent people. That would be an additional workload but in the longer term it would alleviate the workload.

Challenge: making the nodes behave in a stable way and they fail over nicely when there are drop-outs.

Squid

  • Squid 4 is not really performing as well as we'd like, relatively to squid 2 system that we had before.
  • Upstream is hopeless, we had several bugs opened, but we don't really get any kind of feedback.
  • There are alternatives (e.g. Varnish) but it's a significant amount of work to test some of these, prove that they work and potentially across to them.
  • Some experimentation work needs to be done and to find a solution that is more scalable and probably some additional monitoring software and stuff written.
  • There's a number of things that we could architecturally improve, time and effort required.

Suggestions for tile service

Are we able to continue to do this for everyone in the world?

Split tile service to two different services

  • One for osm.org and immediate friends, locked down and managed by sysadmins.
  • Move everyone else to the second, pick a group of people and ask to run it (commercial provider or volunteers). Give access and ask to find a solution rather than us finding a solution or managing it.
On implementing and managing API keys
  • API keys they are a burden to manage, because you have to be constantly monitoring and have a whole system in place for counting API keys.
  • That is beyond the amount of work that our sysadmins could handle. They would prefer another group to manage this.
  • There is the question of how the system would work in the first place.
On differentiating between OSM and friends and everybody else
  • Service would be denied by default and only certain urls or application would be allowed on a list basis.
  • It will block random websites that embed an OSM map.

On spoofing

  • Offline use apps/non-browser programs are in a position to spoof and perhaps will. There's probably less of those than websites that embed maps.
  • Could be simple to spoof.
  • In the past we have blocked some applications and they just spoofed browsers or the wrong URL.
  • There is a management burden regardless of what we do.

Kicking abusers of policy and implementing restrictions

  • Tile usage policy created a few years ago. Possible to kick people off if they're using too much and put different restrictions about faking apps. Has helped to a degree but not massively.
  • Trying to block referrers and user agents is not going to be a plausible way of attacking the problem, even if we have a policy that allows us to do so.

Suggestions by sysadmins

Grant:
1) Decide who are our friends who are allowed to use tiles, then
2) Amend our usage policy over time to say that this only for those friends, then
3) Get help from people like Paul to start enforcing the policy change.

Tom:

  • API keys would be more sensible.
  • Doesn't know on technical level how to implement API keys with our architecture.

Decision: Have regular video-meetings every 2 weeks open to people who want to help, especially for onboarding. The board to hep organise and facilitate meetings. Meetings to have an agenda and volunteers on the call to start thinking about problems that we have, and then point them to testing framework. Board members cannot chair the meetings.

API keys related

Technical challenge of showing architecture and asking to make the API keys for it:

  • Working out how to scale out so that our network of 40 or so nodes can query that database at sufficient rate that doesn't impact the tile serving load, the amount of usage.
  • Hard to find the right person that we can ask to solve this as a contractor.

Hiring

  • Neither of our sysadmins is particularly keen spending time managing contractors.
  • One of them expressed that he doesn't know if hiring would work if we don't have a concrete idea of what we want to do. Contracting someone to do the management of API keys once you have them is a different matter, as it is a simple administrative task.

On hiring the sysadmins

The sysadmins were asked if they would be interested in working for OSMF.

Onboarding

  • Most volunteers are interested in sysadmin.
  • Failure to engage.
  • We don't have great onboarding till we get more people.
  • Got a lot of responses to OWG call, but 1) they were not really OWG responses and 2) we didn't know how to deal with them.

On access

  • Access not granular. If given access to system, they get access to everything. So, hesitant to give keys to castle.

Geocoding

  • Situation worst than tile service.
  • Only a tiny part of traffic is us. Vast majority is 3rd party. E.g. vehicle tracking companies, with vehicles driving down the road and every second they are querying with their vehicles what is the location we are at. That's a service that gets completely overloaded.
  • We can continue throwing hardware - or that service we tend to buy the hardware, so we make a request to OSMF to purchase additional hardware.

New testing framework

as a way to engage more people

  • Tom and Michal Migurski put a lot of effort on it.
  • Allows people to get visibility and provide contributions we have more faith in. Encourage volunteers to contribute.
  • We get immediate idea if things work.
  • Test framework is not working reliably yet - Github bug.

On Github bug

- We use free Github tool. Moved from Travis to Github because we had more parallelism and could run tests faster.
- Guillaume has asked Github if they have a charity account, but never heard back from them.

  • Jobs running on Github actions often fail to check things out from OSM Git server. Nothing else seems to have problem checking out from this server.
  • Don't want to build own test.
  • Probably network issue.

Suggestions

  • Get paid version and then ask for support. Theoretically should work but they have big clients who might get priority.
  • Reach out to Harsh Govind (Microsoft) to find out why they have not responded.
  • Stop using OSM.org and use Github. Currently website lives on both (on Github most of dev. work - where we deploy for production).
    • Migurski tried. One of the issues was that tags on osm.org are not on Github and was failing because of that.
  • Least best option: Use AWS to run tests ourselves. Give up Github actions and Travis.

How people can help

  • Long term planning. Paul spends 5-10 hrs a week for planning.
  • Responding to emails.
  • Account deletion/change of email address (which should be made self-service via website. Andy Allan has been asked but does not have time to help build the tools).

Action Item: Paul to find cases where things get done manually (Paul doesn't get support@ emails via otrs).

Next steps:
Items to be prioritised and then delegated. Have someone develop the tools. Delegating tasks via contact form to other team.

Policy for banners - took long time for OWG to take decision.

Servers in Europe

UK donated resources:

  • UCL rackspace but can't expect to be given certain things (e.g. full network resources).
  • Bytemark: might be asked to leave in a year.
  • Get 2nd data center in Europe? (past LWG suggestion).
  • Move: Cost: 15K/year - better service.

Redundancy in systems

  • We have some, but probably not sufficient.
  • If we get 2nd datacenter in Europe, synchronous databases.
  • Changes in Rails might be required (next version of Rails).

Ironbelly

  • Runs core stuff.
  • Not kept in sync all the time.
  • Manually synchronised.
  • Working on it: partially resolved issue.

Agenda item: The board has asked us to improve redundancy - what to do.

Contractor: could be involved with planet replication.

Other

Need solution for planet downloads (main thing on Ironbelly).

Summary of sysadmins talk

  • Open to paid contractors.
  • Don't want to do management.
  • Help to collaborate on Github on newcomers.

Suggestions:

  • Board member to be part of regular OWG meetings.
  • Double check with sysadmins policy.
  • Follow up - written email that summarises discussion.
  • J. offered to set-up system to contacting sites using tiles.
  • G. set up system similar to French tile solution.

Decision: Have regular video-meetings every 2 weeks open to people who want to help, especially for onboarding. The board to hep organise and facilitate meetings. Meetings to have an agenda and volunteers on the call to start thinking about problems that we have, and then point them to testing framework. Board members cannot chair the meetings.

Action item: Paul to find cases where things get done manually.