Working Group Minutes/EWG 2013-04-01

Attendees

IRC nick	Real name
apmon	Kai Krueger
Firefishy	Grant Slater
iandees	Ian Dees
tmcw	Tom MacWright
zere	Matt Amos
Summary

Long lines rendering issue
- pnorman reported an issue where very long (world-spanning) lines were created in the osm2pgsql rendering database causing strange visual artefacts and long rendering times.
- There was discussion of which component was likely to exhibit the problem, and work-arounds.
- ACTION: zere to try to reproduce long way issue and find out what component is affected and file a ticket as appropriate.
API outages
- The question was raised over how editors can find out about API outages other than simply getting failed API requests.
- ACTION: zere to add something to the capabilities call to make this information available to editors.
Discussion of whether the pingdom monitoring was fine-grained enough to cover all the components of the API.
IRC Log

17:04:27 <zere> welcome, everyone. hope the new time hasn't been confusing
17:05:04 <zere> minutes of the last meeting: http://www.osmfoundation.org/wiki/Working_Group_Minutes/EWG_2013-03-25
17:05:13 <zere> please let me know if anything needs changing
17:06:06 <zere> pnorman mentioned an issue last meeting which we didn't have time to properly discuss, so let's start with that today.
17:06:51 <zere> the issue was that a large way was created spanning pretty much the whole world, and it had a bunch of unfortunate downstream effects in osm2pgsql, mapnik and friends
17:09:33 <zere> apmon, you're the most knowledgable about that part of the stack - are there specific things we should be breaking out of this, or is it a straightforward bug we should file a ticket against (in osm2pgsql / renderd / mapnik)?
17:10:29 <apmon> that part is most likely a mapnik issue (which I know relatively little about)
17:11:38 <apmon> I haven't looked at the issue in any detail though
17:11:45 <zere> all i could think of for mapnik would be that a large way would cause precision issues when agg scaled to fixed point numbers.
17:11:53 <apmon> I do think it has happened in the past though as well
17:12:04 <zere> but i thought osm2pgsql chopped up long ways to avoid that sort of thing?
17:12:39 <apmon> I'd have to check if it does (and if that is still working)
17:15:30 <apmon> It does look like there is code in osm2pgsql that splits long geometries
17:15:40 <apmon> I'll need to investigate that further then
17:16:03 <zere> ok. worth filing a ticket so it doesn't get lost, or not?
17:16:31 <apmon> probably yes. But I guess it isn't clear yet which component exactly it applies to
17:16:49 <apmon> probably most helpful would be if someone can reproduce the issue in a small local extract
17:17:17 <zere> indeed. do we have any volunteers for that?
17:17:30 * zere looks in pnorman's general direction
17:17:32 <apmon> e.g. see if one can reproduce it my taking a liechtenstein extract, then moving one node in a way to australia and see what happens
17:18:19 <zere> not sure you'd even need the liechtenstein extract. i suspect it would be enough to start with an empty db, then add a single long way in a diff.
17:18:50 <zere> i supposed, having just said it was easy, that means i should volunteer ;-)
17:19:08 <apmon> :-)
17:19:18 <zere> #action zere to try to reproduce long way issue and find out what component is affected and file a ticket as appropriate
17:19:37 <zere> i'm guessing the easter weekend means we're underpopulated here...
17:19:48 <zere> but does anyone have anything else they'd like to discuss?
17:20:36 <iandees> anything we should share about the outage?
17:22:14 <zere> from a software perspective, not so much... my understanding is that it was a hardware failure and Firefishy went to heroic lengths to fix it on a holiday sunday.
17:22:51 <zere> is there anything we should share in addition to what's been shared already?
17:23:34 <apmon> I guess a question is can something be done to reduce the probability of Firefishy needing to go out to the DC on a holiday sunday
17:23:49 <apmon> although I suspect that is more of an issue for the OWG than EWG
17:24:12 * iandees nods.
17:24:20 <iandees> I was thinking mostly guidance for communication
17:24:28 <iandees> but i didn't scroll far enough down on the blog post
17:24:36 <iandees> which is fine
17:24:55 <iandees> (http://blog.osmfoundation.org/2013/03/31/database-maintenance/ is what i'm referring to)
17:24:57 <apmon> Perhaps one question for EWG, is how did the editors handle the API outage
17:24:58 <zere> the plan is there for an extra server capable of being failed-over-to. it's just that reality pre-empted the plan somewhat. should have moved quicker, with the benefit of hindsight
17:25:27 <apmon> Did many people loose a lot of data because they did edits and then couldn't upload them?
17:26:15 <apmon> Particularly the period of read-only rather than offline might have caused issues to josm and non osm.org potlatch users
17:26:46 <iandees> also, is there any software work that can be done to automatically fail over to "read-only" mode so the website is at least responsive?
17:26:48 <apmon> all editors should probably have a good warning system if the api is read-only
17:27:58 <zere> yes, i'm not sure what potlatch users could do about it. with josm there's always the option of saving the change for later, but i don't think potlatch or iD has that, do they?
17:28:22 <zere> in an ideal world they shouldn't have to, of course...
17:29:02 <apmon> my guess would be  a good proportion of non expert JOSM users might not know how to deal with a read-only API either
17:29:34 <tmcw> zere: iD stores changes in localStorage so they can be recovered
17:29:44 <tmcw> but doesn't have much in terms of conflict resolution
17:30:00 <tmcw> is there a call to check if the api is read-only?
17:30:11 <zere> tmcw: cool.
17:32:11 <apmon> I think someone said the api/capabilities call doesn't provide that info
17:32:30 <zere> "writing" API calls will return service unavailable if the api is read only
17:32:45 <zere> but not capabilities - and that information isn't available there
17:32:57 <zere> i'll quickly add that...
17:33:58 <tmcw> please update https://github.com/systemed/iD/issues/1224 when there's some info
17:34:07 <iandees> zere: capabilities will return a 200 when others are returning 500's (thus pingdom requests a node)
17:35:18 <zere> i think capabilities was also returning 500. it did for me at times. i think because something in there references an activerecord class which wasn't loaded
17:35:40 <apmon> looking at the log of OSM-HealthCheck, it looks like some calls returned 500 whereas others returned service unavailable during offline mode
17:36:31 <iandees> i was thinking about other downtimes, not this most recent one
17:36:39 <zere> i wonder if database unavailability is something that's unit-testable within rails...
17:37:57 <apmon> Not sure if database unavailability is testable, but the db-offline and db-readonly mode probably should be
17:38:57 <zere> indeed, but i think any tests for db-offline mode when the database isn't really unavailable are giving us false passes.
17:39:48 <zere> iandees: i think i've missed your point, then... could you elaborate, please?
17:39:57 <apmon> you might be able to set a broken db config, if you can change the config on a per test case
17:41:02 <iandees> zere: there were times when people were complaining about map calls or individual node/way endpoints not responding but pingdom didn't alert because it was only sending alerts for when capabilities was failing
17:42:50 <zere> yeah, should be possible to cover more of the API with the pingdom tests. i'll ask Firefishy.
17:43:09 <iandees> right now it's doing capabilities and a node
17:43:15 <iandees> by id
17:43:27 <apmon> the online / offline / readonly status is in application.yml. So if you can set that in a unit test case you should be able to set the config settings in database.yml to simulate a non existing db server
17:45:52 <Firefishy> We are running out of pingdom checks, but yes. Pingdom can do lookups and parse results for text eg to pass.
17:46:34 <zere> ah, we only have a certain number of endpoints before we need to upgrade to a more expensive account?
17:46:42 <Firefishy> Yip.
17:46:48 <Firefishy> Although I can likely trim some.
17:48:35 <apmon> how expensive is a more expensive account?
17:49:13 <apmon> if it would be helpful, then presumably that is something that osmf could pay for?
17:49:13 <Firefishy> 15 extra checks are $9.38/month
17:49:20 <Firefishy> Yip
17:50:20 <apmon> That is amazingly expensive for what it sounds like it is doing, but still affordable it seems
17:53:05 <zere> i dunno, just over $100/year doesn't sound excessive, given the complexities of pinging these URLs from a bunch of servers all over the world and making sure that it's reporting real failure rather than intermediate network flakiness.
17:53:27 <zere> i mean, on top of the several hundred that the existing package costs ;-)
17:54:01 <zere> but then we're only looking at doing something /full or /history and the map call to cover all the bits of software, right?
17:55:09 <zere> speaking of which, i had a random question - is C++11 acceptable nowadays? if i ported cgimap over to require that, do you think it would cause problems?
17:57:13 <apmon> as cgimap isn't directly general purpose software, I think as long as it works on the osm servers it is more or less acceptable
17:57:28 <Firefishy> zere: It would be good if the /api/capabilities call report API readonly status
17:58:12 <zere> yup, i'll get right on it
17:59:02 <zere> anything else anyone wanted to discuss?
18:06:08 <zere> thanks to everyone for coming. hope to see you next week - same time: 5pm UTC.