Operations/Minutes/2024-03-21

From OpenStreetMap Foundation

OpenStreetMap Foundation, Operations Meeting - Draft minutes

These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.

Thursday 21 March, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co

Participants

Absent

Minutes by Dorothea Kazazi


New action items from this meeting

  • Paul to document in the ticket https://github.com/openstreetmap/operations/issues/518 that we will go with Znuny [Topic: OTRS]
  • Grant to reply to Equinix, restating the issue in a brief form to them. [Topic: Equinix]
  • OPS to ask the community whether anyone who works for Equinix wants to put us forward to the Equinix foundation as a charitable organisation they'd like to support. [Topic: Equinix]
  • Paul to open an issue about large wiki pages. [Topic: Large wiki pages]

Reportage

Action item 2024-03-07: Grant to open tickets about not forwarding incoming spammy tickets to other email servers, where they get bounced

  • 2024-03-07 Initial wording: "Grant to open a ticket about running our own IMAP server, to describe the problem and sample cases" [Topic: Running our own IMAP server]

Consensus seemed to be that running our own IMAP server is not a good idea.

Issue: we're being blacklisted for being spammers and this is related to

  • we're accepting and forwarding genuine spam.
  • we don't rewrite headers, failing the SPF sender policy framework and
  • our systems are seeing spam come from ourselves.

Suggestions

  • Give to members of working groups and other Foundation bodies an @osmfoundation.org address.
  • OTRS: email notifications regarding new OTRS messages not including the content of the messages anymore.
    • This will not work for DWG, as sometimes they reply from emails.

Other points mentioned during discussion

  • SPF - has been addressed.
  • we shouldn't be forwarding certain messages via email, they should be injected into APIs.

Decision: Open multiple tickets for the different issues. There is already one for the forwarding aspect but not for the incoming spam and OTRS aspect.

Action item 2024-03-07: Grant and Guillaume to discuss further about the redundancy of gateway / IPv6 "private subnet". Benefit / "Cost"

  • 2024-03-07 Grant and Guillaume to discuss further about the redundancy of gateway / IPv6 "private subnet". Benefit / "Cost" [Topic: Redundancy of Gateway / IPv6 "private subnet"]

There was a long discussion.

  • UCL is the blocker.
  • A lot of the traffic that we pass over VPN could be now encrypted over TLS.
  • The gateway usually runs on the oldest machine in the rack.
    • Not high priority but could go wrong.

Suggestions

  • Create an issue and possibly park it.
  • Drop this issue.

In favour of dropping this issue

  • Work required: There are multiple changes required in Chef/monitoring/networking stack, for a marginal improvement.
  • Network management: We would move some of the network management out of software control into our network layer stack, which is not ideal.
  • Prioritisation: There are more important tasks.
  • We have data on the reliability of the current set-up and it has not failed in over a decade.
  • Having out of band access only works via IPv6 with unknown reliability, is a blocker.
  • If we can't use IPv6 on our out of band network because of some of the out of band hardware that we have, then we'd have to run another parallel network.
  • Preference to spend time on Nginx and different backends, than designing Ipv6, which may not work with some of our equipment.
  • Not convinced there is a problem.

In favour of keeping this issue

  • It is a single point of failure, even if it hasn't failed in the past.
  • Presenting the issues we have helps with fundraising.

Other points mentioned during discussion

  • If what we're trying to achieve cannot be described in a GitHub ticket, then the issue is not actionable.
  • A solution came presented and the way it should be implemented was dictated, causing annoyance.
  • Conflicting suggestions on addressing the issue. Attempts to raise the issue for discussion were redirected to creating a GitHub ticket, while when proposing to create a ticket, advice was to discuss it.

OTRS

https://github.com/openstreetmap/operations/issues/518

Grant will show results of his tests

Options:

  • OTOBO
  • Znuny

OTOBO

  • Docker-crazy, doesn't support podman.
  • Introduces new features, including search and elastic indexing.
  • Only supports the installation method and their containers, like Discourse does.
  • Work:
    • Requires manual work, as it uses different DSL.
    • We would have to convert their docker compose to Podman pods. Adding podman support to Chef will be helpful elsewhere.
    • We would need time investment every time we need an upgrade, like we do with Mediawiki (new and default plugins/configuration/variables).
  • Concern: There is danger of closing their repositories and making it commercial.
  • Version 10.1 and have patch releases.
  • You have to run Elasticsearch and probably Memcached.

Znuny

  • Features stay close to the original.
  • Migration:
    • it requires stepped migrations between versions.
    • is a bit more painful, one-off sunk cost, but packaged in Debian.
  • Implementation: Stays close to the original. Required stepped migration. Upgrade smooth. Looks and feels like OTRS.
  • Debian package doesn't seem to be updated much, but this might be because the maintainer hasn't bumped the version yet. Supports version 6.5 (7 is out).
  • Search: Doesn't use Elasticsearch - pretty close to OTRS.
    • New search plugin API, which relies on Elasticsearch. Has many options, not very well described, English documentation.
  • They have a long term supported version.
  • Wikimedia and a lot of open source groups moved from OTRS to Znuny.

OTRS

  • We have used it for over a decade.
  • Most OTRS functionality seems to be currently done in plugins.
  • We use it pretty much like any small ticketing system, with some templates for responses, groups and queues.

Consensus seemed to be: go with Znuny, as it seems better, with a straightforward upgrade path.

Other points mentioned during discussion

  • It's a one-way decision.
  • Anything to ask the users?

Action item: Paul to document in the ticket that we will go with Znuny.


DDOS attack

An individual emailed us saying that they found a security vulnerability, and requested Bitcoins, while DDOSing us from 5000-6000 IP addresses, from probably exploited servers. There were a few other people that received an identical email, a month ago. A UK mobile number was listed on the email.

Measures taken for the DDOS attack
1. Tom put-in mod_evasive for high-requests rates. It worked but we accidentally blocked some mappers. Now it is reasonably tuned, by upping the threshold of requests per second.
2. If one continues with high request rates, fail2ban picks it up and they get blocked for a longer period after a certain period.

Requests blocked

  • Most are abusive.
  • Minority: Mapping parties or people behind a single IP address, so further fine-tuning of mod_evasive might be needed.
    • Data: 2 such requests today (1 with Rapid).

mod_evasive

  • Third-party package, heavily unmaintained.
  • Is on either on all requests or on none.
  • Can specify number of requests per time frame.
  • Cannot set-up multiple time-periods.
  • Has an allow list feature ("DDOS white list""), which we don't use at the moment. Could add IP addresses, and these do not enter into the counter and the counter never overflows.
    • Unmaintainable long term.

Alternatives to using mod_evasive

  • Mod-security which we can define policies for fine-grained access control.
    • Issue: it's complicated.
  • Nginx.
    • Has a fairly good, granular rate limiter built into it: can rate limit per any URL or a combination of headers and URLs.
    • Tom has been looking into it.

Suggestion: Replace mod_evasive.

Other points mentioned during discussion

  • iD: gives a retry button - if you click it immediately, you get blocked for another 60 sec.
  • Wondering if there is a bug in iD that causes many requests.
  • Hard to work-out the limit.
  • mod_evasive is picking legitimate abuse.
  • If the high-requests rate continues, then fail2ban picks it up and blocks you for a longer period.

Equinix

Suggestions

  • Restate the issue to the salesperson in a brief form.

Other points mentioned during discussion

  • We were informed in November and the price increase was effective in January.
  • Equinix Foundation is managing donations, either monetary or providing staff time, to charitable organisations.

Other options

  • Grant emailed the Equinix Foundation and received a response with some pointers.
  • Contact the salesperson and ask them to approach Equinix Foundation .
  • Ask the community whether anyone who works for Equinix who wants to put us forward to the Equinix foundation as a charitable organisation they'd like to support.
    • Probably the best option.

Action items:

  • 'Grant to reply to Equinix, restating the issue in a brief form to them.
  • OPS to ask the community whether anyone who works for Equinix who wants to put us forward to the Equinix foundation as a charitable organisation they'd like to support.

Large wiki pages

Issues

  • The length of big OSM wiki pages pushes the limits of what Mediawiki can handle.
  • We occasionally get blocked from Wiki Commons, as sometimes our traffic is considered abusive, causing indexing scripts to fail.
  • People listing thousands of relations on a single page.

About Wiki Commons We've enabled Wiki Commons, which allows us to easily embed images from wiki Commons into our wiki.

  • Front end: Visiting the Map Features page generates 150+ requests to Wiki Commons, for downloading images stored there.
  • Back end: Certain indexing tasks call out to Wiki commons.

Map features page
https://wiki.openstreetmap.org/wiki/Map_features

  • Calls a plethora of templated functions, ~ 1000.
  • Whenever there's a wiki problem, it seems to be related to map features.
  • It probably is not a good experience to be shown this big page, e.g. to new people at a mapping party.
  • Takes several seconds to load (20 during the meeting).
  • Pressing the "edit" button (not the "edit source" button) will crash your browser and there is a hard-coded limit in the current version of Mediawiki of 30.
  • ~ 5000 transcludes in.

Suggestions

  • Put limits on the page (e.g. requiring to break the page into subcategories) or to the number of embeds or their size.
  • Put a front-in cache to all Mediawiki instances.
    • We had one and it became complex to manage and we removed it.
    • Caching works best for logged out users, but as soon as you log in, then you being cached per user.
  • Add plugins that do caching.
    • They are semi-supported.
  • Produce a static HTML page, instead of having the contents in a CMS system.
  • Open an issue.

Other points mentioned during discussion

  • Mediawiki caching is for an infinite amount of time. Mediawiki is sending purges to clear it. Works for their architecture but only for them.
  • There is a function/category to show long pages.
  • Not a good experience to be shown such a long page at first mapping party.
    • Being shown a big list of things that can be mapped, helps new mappers understand that OSM is not only about mapping streets.
  • Map features: Opening the "edit" not "edit source" will crash the browser. 20 sec to load.Hard-coded limit is 30 sec.

There was a plan to move it to the Semantic wiki.

Action item: Paul to open an issue about large wiki pages. Post meeting addition: https://github.com/openstreetmap/operations/issues/1046


Faffy status

https://hardware.openstreetmap.org/servers/faffy.openstreetmap.org/ Development and “tool” server HPE ProLiant DL360 Gen10

Current status

  • querying any of the individual disks we get multiple gigs per second.
  • querying MDRAID we get less than a megasecond.
  • disk usage was ~ 99%

Issue could be related to

  • hitting a kernel performance bug.
  • MDRAID NVMe issue.
  • the file system could have become massively fragmented, particularly in the directory indexes.

Suggestions

  • Remove a disk, wipe it, secure erase it and then re-add it and check performance.
  • Rewrite a file with zeros - Grant tried this.

Grant has tried

  • Upgraded bios.
  • Removed iLO card.
  • Removed power-cap.
  • Upgraded firmware.
  • Looked at kernel parameters related to tuning performance and schedulers.
  • e4fsck directory optimise.
  • Defragemented the file system.
  • Checked Raid6 system - no mismatches - check was very fast.

Machine is a bit faster now, but not near the performance it should have.

Other points mentioned during discussion

  • Haven't done test write performance per disk.
  • Bottleneck hard to decipher from metrics in Prometheus

Editor policy

Deferred.


Action items reviewed at the beginning of the meeting

  • 2024-03-07 Grant to open tickets about not forwarding incoming spammy tickets to other email servers, where they get bounced [Topic: email notifications getting marked as spam]
  • 2024-03-07 Grant and Guillaume to open a github issue about the redundancy of gateway / IPv6 "private subnet". Benefit / "Cost" [Topic: Redundancy of Gateway / IPv6 "private subnet"]
  • 2024-02-08 OWG to review the Editor policy during one of the next calls and possibly vote on it. [Editor Policy adding to OpenStreetMap.org]
  • 2023-11-30 Grant to revisit the "policy for purchasing" document, which currently is focused on specs, and add information such as the process for obtaining approval for purchases. [Reportage] Added info: Who Approves / Steps etc -> Grant to create GitHub ticket
  • 2023-11-30 OPS to review the issue of spam reports to ISPs in 6 months (May 2024) -> Grant to create GitHub ticket
  • 2023-05-18 Paul to start an open document listing goals for longer-term planning. [Topic: Longer-term planning]

Action items that have been stricken-through are either completed, or have been moved to GitHub tickets.