Operations/Minutes/2023-11-02

From OpenStreetMap Foundation

OpenStreetMap Foundation, Operations Meeting - Draft minutes

These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.

Thursday 2 November 2023, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co

Participants

Minutes by Dorothea Kazazi.

Absent

New action items from this meeting

Reportage

GitHub template for missing attribution

  • 2023-09-07 Paul to create a GitHub template for the new repository https://github.com/openstreetmap/tile-attribution/ which will be only for cases of missing attribution from sites using our tiles. [Topic: (With LWG) Issue template/checklist for blocking sites without attribution]

At what point do we have someone from the OSMF looking to make sure that the action is reasonable?

  • We decided that we needed to have some OSMF oversight before blocking sites that allegedly use OSMF tiles without attribution.
  • Mateusz Konieczny and Guillaume Rischard are taking care of it.

On the GitHub repository on domains with use OSMF tiles and lack attribution

  • Grant Slater has worked on the automation required to create the export list from the tile attribution repository, of the domains that should be blocked from receiving OSMF tiles.
  • The process needs some documentation.
  • If someone with the right privileges adds the "accepted" tag to a ticket on that repository, the domain remains on the list of domains blocked from receiving OSMF tiles. Once the ticket is closed the domain is removed from the list.

Suggestion: Automatically add the list to the Fastly dictionary.

OPNVKarte featured layer

  • Had an SSL certificate issue which was fixed.
  • Concern: repeated issues and unable to get response from the single person who supports the featured layer.
  • Getting 404 on some of the non-cached tiles.

Suggestions

  • Remove OPNVKarte.
  • Grant offered to rebase.

Other point mentioned during discussion: Might not have time to rebase.

Action item: Paul Norman to update the pull request https://github.com/openstreetmap/openstreetmap-website/pull/4126

HE Network

The problematic link is up.

Dublin outage

  • The genuine downtime we had was minimal: we had 1 hour with 20-30% packet loss.
  • Our packet loss does not capture the full outage, as it was out to parts of Europe.
  • It depends where the traffic gets handed to HE.
  • The outage impacted traffic from Eastern Europe going to Dublin.
  • Various outages in the month.
  • They clearly had other capacity, it just wasn't sufficient.

On refund

  • Missed on a refund on the Dublin outage as we didn't raise the issue with them.
  • You can get a refund if:
    • you raise the issue and they don't fix it within a two hour period.
    • full internet outage - but refund still complicated. Email might be needed to claim the refund.
  • The document doesn't make clear if there's a double dip option in a network outage scenario.

Suggestions

  • Explicitly say if both sites are out, preferably by separate tickets.
  • Paul to call sales person and ask what they plan to do to prevent future outages if a cable is lost.
    • US case. Power in one of their own data centers - they didn't have enough back-up power capacity.
  • Different ISPs would be ideal, long-term.
  • Anycast: announce the same /24 from both data centers, but depending on the ASN that person is coming from.
    • We wouldn't want to do that because each data center is going to be using its own database and then we would get a replication problem with databases.
    • The link would be the problem and how you run that tunnel - we can't run that tunnel across 4G.

On BGP

  • BGP tunnel - the tunnel would have gone down.
  • BGP, own subnet, 2 ISPs: we could change how we announce it.
  • We don't have the capacity to take on the additional workload of BGP.
  • Once BGP is set up it works, and we can get help from people like Clement.

Other points mentioned during discussion

  • Both links will be end of life next year.
  • Resiliency in AMS is better.

Tom disconnected 21'.

osm2pgsql-replication

25' Tom rejoined.

Related: https://switch2osm.org/serving-tiles/updating-as-people-edit-osm2pgsql-replication/

Paul did a setup with osm2pgsql for a client and it is easier from what we got.

On complaints

  • We've had complaints about the algorithm we use.
  • Most have been because of our version of the expirer. People have changed a relation and they're expecting a change.
  • Recent case: Name of an island (relation), related to recent vandalism.

On expiring tile algorithm

  • Custom, Ruby implementation, stored in Chef.
  • It looks at the location of the nodes and the way nodes that are touched.
  • It only takes an action once per invocation.
  • The script will have to read a list of tiles, but it doesn't de-duplicate the list.
    • it would de-duplicate, but probably not meta tiles.
  • It gives tile numbers which "renderd expire" (which mark the tiles as dirty) can read.

Suggestions

  • Have a PR to make it do less than 1 minute sleep between retries.
  • We ignore relations completely, so we only handle nodes and ways. If one changes a relation, that won't cause any dirtying, whether it's tags or members, and a way which spans a tile without having a node in it won't dirty that tile, it will only dirty the tiles where the way has nodes.
  • We shouldn't push expiries all the way to the edge.

Other points mentioned during discussion

  • Servers are more reliable about updates, so we notice cases where they're not updating.
  • There's no point going to OSM2PGSQL replication, if we don't also move the dirty algorithm.
  • With our load levels the issue is not how many tiles are being requested from the back end, it's how many renders it has to do.

Action item: Paul to sketch out the different components we would need to implement and how they relate.

Fastly purges

  • Grant tested a Fastly purge script and it was very slow.
  • Can't hit the 100,000 requests per hour limit sequentially, because their API responds so slowly.
    • the limit is per account, so probably across multiple distributions.
  • We can assume anything in the level one cache is also in the level two.

Suggestion: use fire headers.

Any other business

AMS data center visit

  • Paul to visit the AMS data center on Thursday.
  • Delivery expected soon, probably tomorrow - past customs.
  • Grant encountered some issues to find the correct tax code for customs.

Suggestion: shut down Karm (Read only database mirror for www.osm.org) completely, to switch both of the PSUs: going from 1600 watts to 1000 watts.

Action item: Dorothea to create accounts on the OSMF NextCloud server for the OWG members and add Grant as an admin.

Delivery of hard disks for Prometheus

Arrived, according to Lance.

  • Suggestion: Go for Raid6
  • From Raid 5 to 6 we would have to add a disk.

Question by NorthCrab about LVMs

  • We used to use LVMs but it's painful, because we have hardware raid on many of the machines.
  • Used HP machines come standard with Raid.
  • We tend to allocate 100% of the storage
  • The hardware Raid controller used to be a pre-installed component, but is optional on Gen9 and Gen10.
  • Every disk we get from now going forward is probably going to be NVMe.

Restore archive planet files to S3

Operations ticket 967: https://github.com/openstreetmap/operations/issues/967

Ongoing.

Grant's script currently doing:

  • planet PBF files from 2021.
  • full history from 2020.
  • planet OSMs from 2022.

Other points mentioned during discussion

  • It takes nearly two days to restore a file, but it restores a full year batch in one go.
  • It may finish like end of this week

Open Ops Tickets

Review open, what needs policy and what needs someone to help with.

Action items

  • 2023-10-19 Grant to open an issue on the S3 script. [Topic: Proceed with moving replication Diffs to AWS? https://github.com/openstreetmap/operations/issues/971]
  • 2023-09-21 Grant to create a table on the cache headers that we send. [Topic: Surrogate key patch]
  • 2023-09-07 Paul to create a GitHub template for the new repository https://github.com/openstreetmap/tile-attribution/ which will be only for cases of missing attribution from sites using our tiles. [Topic: (With LWG) Issue template/checklist for blocking sites with out attribution]
  • 2023-06-29 Grant to put Martijn van Exel's policy for addition of OSM editors to the osm.org menu out for feedback. [Topic: Draft policy by Martijn van Exel]
  • 2023-05-18 Paul to start an open document listing goals for longer-term planning. [Topic: Longer-term planning]
  • 2023-05-04 [WordPress] Grant to share list of WordPress users with Dorothea and their response to keeping an account. [Topic: WordPress security] - Shared, but additional work required