Operations/Minutes/2023-01-12

From OpenStreetMap Foundation

OpenStreetMap Foundation, Operations Meeting - Draft minutes

These minutes do not go through a formal acceptance process.
This is not strictly an Operations Working Group (OWG) meeting.

Thursday 12 January 2023, 19:00 London time
Location: Video room at https://osmvideo.cloud68.co

Participants

Guests

Minutes by Dorothea Kazazi.

New action items from this meeting

  • Guillaume Rischard to look at bandwidth stats for next OPS meeting to make predictions for future consumption. [Reportage: Network upgrades AM6]
  • Grant Slater to experiment with Netbox. [Topic: Asset management]
  • Grant Slaterto scrap Grisu - ask Bytemark to strip the disks and post them. [Reportage: Decomissioning]
  • Grant Slater to email Slough to see if he can make any arrangements. [Reportage: Decomissioning]

Reportage

Network upgrades AM6

Status: Pending.

Guillaume looked at the quotes from HE.net.
Biggest consumer is the Planet, takes few days to generate and it takes 25' to transfer using 1 G, so switching to 10G is not a big time-saving.

Dublin: 10G link with 10G commit, but we are limited by 2G per machine and inbound.

Dublin prices

  • 1 G - 375 USD
  • 10 G - 1200 USD

Looks like we shouldn't do it, if we move the planet to S3.
We can look on how the usage of the bandwidth evolves.

HE.net quotes

  • A full 10G commit was cheaper than a less than a 10G commit.
  • We can negotiate for a better price.

Action item: Guillaume to look at bandwidth stats for next OPS meeting to make predictions for future consumption.

Imagery

  • Grant has been playing with imagery. Has not chosen the standard yet. Going to use cloud-optimised geotiffs.
  • A New Zealand governmental institution wrote a promising piece of open-source code without documentation yet.

Render update

See https://twitter.com/OSM_Tech/status/1613681130829406209

Discourse

Guillaume asked Grant to work mostly on the Discourse migration.

Upcoming downtime

Scheduled downtime is before the next meeting.

Suggestion: Add a banner on the site.

We don't have a good way to do a text banner anymore - image banners in the last years.

Decommissioning

Michelle Heydon (OSMF accountant) is updating the asset registry (deprecated assets/etc).

Suggestion turn servers off at Bytemark.

Action item: Grant to scrap Grisu - ask Bytemark to strip the disks and post them.

UCL/Slough

Slough:

  • Sponsored data center which we had from very early in the project.
  • JANET owns the data-center and University College London (UCL) is a client.
  • We don't pay but we have issues with accessing the site. We're not an employee of UCL and all their policies are about authorised employees having access.
  • 3-4 machines there.
  • They have their own room for decommissioning, but Grant can't access that.

Action item: Grant to email Slough to see if he can make any arrangements.

Asset management

  • Current snapshot of assets on hardware.osm.org
  • We don't have anything for the history of assets.
  • Used the wiki, but it's not that useful.

Issue: Duplication of things. Rejected idea of creating a service which removes duplication, but creates a dependency tree and circular dependencies.

Suggestion: move some data into Chef, effectively as Chef annotations.

Netbox

  • IP address management, automatically resolves where the cables go based on the ports, it has external databases of hardware which can be imported, creates rack/cabling diagrams.
  • Biggest effort getting everything into it and be smart about use of the data.
  • It is easier to use Netbox now that most of our machines are standardised.
  • Paul and Grant have experimented with Netbox in the past.

Concern

  • To not create circular dependencies, with respect to documentation (e.g scenario when Netbox is unavailable).
    • Suggestion: Export data from service, which is still available in a disaster scenario.

Action item: Grant to experiment with Netbox.

Link shared: https://github.com/netbox-community/netbox-docker

AWS Sponsorship

Should we try to get all AWS costs sponsored?

  • All of our credits are in an AWS sub-account, which is used to pay.
  • Our primary account is taking credits from the sub-account.
  • Sub-account should be using 3K USD/month.
  • We will run out of credits in about ~10 months.
  • A big part of the costs for the general account is the logging for the CDN and the planet backups.
  • The planet account can't be used for any other purpose.

Option: Credit sharing (which was enabled by default) could be disabled and we could pay normally.

On sponsorship for all AWS things we use

  • Could be done but would be a lot of admin work for AWS.

Decision: Get planet tasks done and then ask for AWS sponsorship.

Remote hands performance at AMS

  • Took 1.5 hr, the same as data centers in the Netherlands and Dublin to do the same job.
  • Still cheaper than visiting in person.
  • There was an issue during installation of the servers with remote hands considered to be an one-off.
  • Using remote hands also means that Grant has to document in detail what they have to do and confirm what has been done.

Network cards installation with remote hands
Related ticket: https://github.com/openstreetmap/operations/issues/821

  • 0.5 hr for remote hands.
  • We were billed for 2 network cards, 3 were installed and we had asked for 4.
  • Issue with finding the ports.

Suggestion for next installation with remote hands

  • Provide prior documentation regarding the machines and ports, with diagrams.

About supplier and hardware
Supplier: We used to go with a UK company, but since Brexit this is not an option.

Hardware buying

  • Currently buy 2-year old hardware and use for another 8-10 years.
  • We are moving to SSDs. Their least reliable part is probably the power supply.

On disk failures

  • Rare to have unexpected failures. We had 1 case of SSD failure, which was a donated one, 1st generation.
  • We do not put appreciable wear on them and we put a lot of monitoring in place and regular tests (daily/weekly).

The cost of a disk failure is not just the disk

  • We have to order a new disk.
  • Get it shipped to the data center and unpacked.
  • Installed.

If we were a professional commercial enterprise

  • Would use just SSDs. We currently replace spinning disks with SSDs once they fail.
  • A support contract does not help in terms of reliability.
  • We could buy a pre-packaged remote hands bundle. It would increase the cost and there are months where we don't use remote hands at all.
  • Current status considered good
    • We buy reliable hardware.
    • Machines are built so that they have redundancy and can handle two disk failures.
    • As soon as we have a failing disk, we try to replace it.

Open Operations Tickets

Review open, what needs policy and what needs someone to help with

Action items

  • 2023-01-12 Guillaume to look at bandwidth stats for next OPS meeting to make predictions for future consumption. [Reportage: Network upgrades AM6]
  • 2023-01-12 Grant to experiment with Netbox. [Topic: Asset management]
  • 2023-01-12 Grant to scrap Grisu - ask Bytemark to strip the disks and post them. [Reportage: Decomissioning]
  • 2023-01-12 Grant to email Slough to see if he can make any arrangements. [Reportage: Decomissioning]
  • 2022-12-15 Guillaume to make sure Grant does the forum to Discourse migration. [Topic: Forum to Discourse https://github.com/openstreetmap/operations/issues/604 ] # 2022-12-29 In progress. # 2023-01-12 In progress
  • 2022-12-15 Grant to produce a PR and finish the OSQA one. [Topic: Containerisation of small services https://github.com/openstreetmap/operations/issues/807] # 2022-12-29 In progress.
  • 2022-11-03 [Network upgrades AM6] Guillaume to talk with Clement / Open Source ISP about reliability and get insurances that the virtualised Layer2 links to Paris will be reliable. #2022-11-17 Pending # 2022-12-29 Changed from "OPS" to "Guillaume".
  • 2022-09-22 [Network upgrades @ AM6] Grant and Guillaume to talk to Clement about the options for network upgrades. # 2022-10-06 Guillaume talked with Clement
  • 2022-09-22 [Network upgrades @ AM6] Guillaume to provide a schema at next meeting.
  • 2022-09-22 [Network upgrades @ AM6] Grant and Guillaume to work out all the costs involved for option 1.
  • 2022-09-08 Grant to document Chef testing [Topic: How to get more people involved] # 2022-09-22 Chef kitchen tests running locally. # 2022-11-03 pending. #2022-11-17 Pending # 2022-12-29 In progress. # 2023-01-12 In progress - one of the rests fails randomly.
  • 2022-07-14 Guillaume to ask Brian Sperlongano about OpenMapTiles and YAML, to do a test run. [Topic: Vector tile status]. # 2022-09-22: Suspended for a while. Brian is back at school and has no time to work on this. Guillaume will try to move vector tiles forward with someone else. # 2022-12-29 Still suspended
  • 2020-12-02 [AWS] Grant to develop some thoughts on what is next for us using AWS. [Topic: AWS] # 2021-05-19 & 2021-06-02 & 2021-06-16 postponed for a few weeks. # 2022-12-29 In progress.
  • 2020-07-29 [AWS] Grant to enable background sync to AWS S3. [Topic: Ironbelly] #2020-08-12&26 & 2021-06-02 Manually run, automated scripting to be added. # 2021-05-19 Grant to run the script again. # 2022-04-09 Still manually run. # 2022-09-22 wants to make sure there are absolutely minimum permissions. # 2022-12-29 In progress. # 2023-01-12 Still manually run. Grant to look at the policies together with Paul.
  • 2020-07-01 Paul to create a ticket about solutions to reduce incoming comms. [Topic:Revision of acceptable use policy to reduce incoming comms] # 2021-05-19 decision to leave the action item open. # 2021-06-02 discussion about priority for account deletion. # 2022-04-09 Grant can show Paul how to do that with autoresponder which Tom built. Might be better to work on an online form (action item below).
  • 2020-07-01 Grant to work out some of the questions for an online form as a solution to reduce incoming comms. [Topic: Revision of acceptable use policy to reduce incoming comms] 2020-08-12 need to think about the reply # 2021-05-19 decision to leave the action item open. # 2022-04-09 Grant is thinking about examples. Suggestion to add what is considered large for tile usage.
  • 2020-04-10 Grant to work out a table of different data bits, work out how they are backed up and what can be potentially improved. [Topic: High Availability / redundancy of OpenStreetMap.org (and primary services)] # 2021-05-19 decision to leave the action item open. # 2021-06-02 pending

Meeting adjourned 67' after start.


Next meeting

Thursday 26 January 2023, 19:00 London time, unless rescheduled.

Operations meetings are currently being held every two Thursdays, at 19:00 London time.
Online calendar showing the OPS meetings.