Working Group Minutes/EWG 2013-05-06

Attendees

IRC nick	Real name
apmon	Kai Krueger
pnorman	Paul Norman
TomH	Tom Hughes
zere	Matt Amos
Summary

help.osm.org i18n
- Re-cap of previous discussion: Shapado looks like a better option, but it isn't clear what it would take to migrate over existing data.
i18n as a "first class" feature
- apmon brought concerns from the talk-de mailing list about the lack of translations of the new "notes" functionality when it was launched.
- The issue is mainly that:
  1. The translatewiki translations are of master, and it doesn't track branches.
  2. The translatewiki translations take some time to be verified and merged back into master.
  3. There is no "window" to get help with translations before new functionality is pushed live.
- It was generally agreed that we would want translations of a service before it goes live.
- There was a difference of opinion whether lack of translations should prevent a branch being merged or going live.
- ACTION apmon to talk to Siebrand Mazeland to see if there's anything that can be done to improve translatewiki turn-around without it becoming onerous.
Carto benchmarking
- pnorman having some trouble with EC2 / EBS.
Dev server performance
- There was some discussion about what can be done to make the dev server load more even.
- ACTION zere to see what SSDs might be available to help alleviate load from the dev server's osm2pgsql database.
Planets on dev server
- ACTION zere to see what can be scripted up.
Tile serving
- There was a general discussion about what could be done to distribute tile rendering load across multiple sites while still retaining some form of shared storage.
- The concern was that, without shared storage, any additional machines merely end up duplicating the work of their siblings.
- Rough calculations show that a shared "write-through" cache might work - i.e: each machine broadcasting its writes to other machines.
IRC Log

17:00:34 <zere> welcome, all.
17:00:52 <zere> previous minutes: http://www.osmfoundation.org/wiki/Working_Group_Minutes/EWG_2013-04-29 and please let me know if there are any inaccuracies.
17:01:23 <zere> doesn't look like gravitystorm is here, and i haven't seen anything new w.r.t the READMEs
17:02:08 <zere> so let's move on to apmon's request last week
17:02:32 <zere> #topic help.osm.org i18n
17:02:50 * pnorman waves
17:02:59 <zere> i think the last update on this was that there was nothing really upstream in django to help us out, and we were considering a move to something else...
17:04:28 <apmon> In that case shapado might be an option
17:04:57 <apmon> the alternative would possibly be to have a ru.help.osm.org, de.help.osm.org, ....
17:06:53 <zere> the problem with splitting into multiple languages is that there's no opportunity to merge questions or answers across the different sites.
17:07:54 <zere> i guess in an ideal world it would show translated results through google translate or something. just to avoid "ghettoizing" other languages.
17:08:34 <zere> but yes, shapado was the alternative we were discussing (http://www.osmfoundation.org/wiki/Working_Group_Minutes/EWG_2012-10-15)
17:08:34 <apmon> I am bringing this up, as one of the first comments about the merge of notes on talk-de and the forum were complaints about the lack of translation
17:09:03 <apmon> and help is the main major feature that isn't at all translated so far
17:09:16 <zere> i'm sure that was fixed pretty quickly? or were the complaints about help.osm.org, rather than the notes branch?
17:10:19 <apmon> The initial complain was about the notes branch, (which isn't yet fixed, as translatewiki hasn't yet synched the changes back), but help came up in that discussion as well
17:11:15 <apmon> so it really is a wider discussion of how to ensure i18n is a first class feature in all of osm.orgs "services"
17:11:23 <zere> where we left the discussion last time was that shapado looks like a good option, but we weren't sure whether migrating the data across would be feasible.
17:13:40 <zere> i18n is one of those tough ones - it slides into a bit of a grey area. i'm sure everyone wants to see german translations of any UI changes immediately, but where do you draw the line? french, spanish and russian are all popular. italian, dutch and japanese, too.
17:13:52 <zere> where does the "first class feature" stop?
17:14:02 <apmon> I could try and post a question on http://shapado.com/questions to see if anyone knows a way to import a osqa dataset
17:14:29 <pnorman> is there a way to push the phrases that need translating to translatewiki before merging to master?
17:14:34 <zere> because it would clearly be quite impractical to get translations for every possible language.
17:14:44 <apmon> it stopes at the ability to have seemless translations, if there are enough translators for the language
17:15:42 <zere> pnorman: sure, one could push the "new" phrases before they're used as a "pre-merge". it's pretty nasty. a better way might be to ask for help translating in the branch, without translatewiki.
17:15:59 <zere> iirc, translatewiki isn't able to handle branches. but i might be mistaken
17:16:10 <apmon> imho it would be good to have translatewiki pull from a separate branch, where one can merge the feature with its strings, get translatewiki to pull from there and after a week or two merge, once translators had a chance, merge it back into master
17:16:34 <zere> so... delay all merges for at least a week?
17:16:43 <apmon> for major features yes
17:17:31 <zere> could we do it the other way around? have translatewiki push the changes back within a day or so?
17:17:55 <apmon> well, translators need a bit of time as well
17:17:58 <pnorman> were the phrases for notes pretty much settled for some time before it was merged?
17:18:07 <apmon> but having a faster turn around from translatewiki would be good
17:18:40 <apmon> I can try and contact siebrand to see if / and what possibilities there are to improve things.
17:19:00 <apmon> E.g. a way to notify him, once a merge is ready that effects many strings
17:19:03 <zere> i think there was some work on the UI just before the merge, so it's possible that some, but not all, phrases were settled. it would have been better than nothing, i gues.
17:19:08 * pnorman is grabbing water
17:19:55 <apmon> It took 3 years to merge the notes branch. ;-) An extra week or two for translations would not have harmed
17:20:51 <apmon> And if we can setup an appropriate process, the extra effort should also be minimal
17:20:55 <zere> right - notes branch was a bit extreme in that. i think there have been other, shorter branches which changed the UI which were merged in a far shorter time
17:21:21 <zere> and i'm wary of trying to push any process which makes working on rails_port any harder
17:21:31 <pnorman> so there are two causes for the delay of the translations: the time to translate the phrases and technical/administrative delays with getting phrases to/from translatewiki?
17:21:48 <apmon> yes
17:22:04 <apmon> the first, is an inherent delay and there is not much you can do about that.
17:22:20 <apmon> although with appropriate notifications, that can be improved as well
17:23:06 <apmon> Btw, that week or two needed for translations, could probably also be used by CWG to get its blog posts / press releases ready, so that they can be posted on the day the features go live
17:23:33 <zere> i think the remedy might be along the lines pnorman is talking about 1) see if we can set up translatewiki to track (long-lived) branches too, 2) see if we can't speed up the round-trips when big new features are going live.
17:24:17 <zere> rather than do a "stop the merge" translation phase, the translation could have been tracking the notes branch for several months.
17:24:24 <pnorman> I'm thinking we'd only want this for stuff like routing, owl, overpass, etc
17:25:23 <apmon> For things that effect a significant and visible portion of the strings
17:26:02 <zere> ideally, any large UI change. but the downside of adding process is that it's another hurdle to jump and we already have enough of those.
17:26:32 <apmon> Well, that is what I am saying with making it a first class feature
17:26:56 <apmon> e.g. no-one would proposed to get rid of the process of "code-review" as we have to many hurdles to jump through
17:27:27 <zere> what other "first class" features do we have?
17:27:30 <apmon> i18n is an important part of the process
17:27:42 <TomH> we're talking about a couple of bloody weeks, not world war three
17:28:02 <apmon> one could say the same about code quality
17:28:12 <apmon> the bugs will get fixed in a couple of weeks, no drama
17:28:21 <TomH> indeed, they did
17:28:41 <zere> one could say the same about design, but we've never had that. do we want all UI changes to go through a two week design/UX review?
17:28:55 <apmon> never the less one has a process in place to try and ensure that as few bugs as possible get through in the initial merge
17:29:11 <pnorman> if we can push to translatewiki before final code review it shouldn't hold up the branch - I think iD showed that the popular languages get mostly translated absurdly quickly
17:29:25 <zere> anyway, my point isn't about whether or not to have translations or to get help translating branches before merge - that's a good idea. my problem is with adding mandatory processes.
17:29:50 <TomH> look if TW can do that witout getting in my way then fine
17:30:39 <TomH> but somehow I doubt it
17:30:39 <apmon> zere: Imagine the default language is japanees or arabic. Two weeks of untranslated time is a quite a while
17:30:39 <zere> apmon: right, and i'm not saying that translations are bad, or that we shouldn't have translations.
17:31:09 <apmon> but you are saying it is not worth any real effort
17:31:19 <zere> no, i'm not saying that either.
17:31:20 <apmon> it just gets in the way
17:31:37 <zere> nope, i've never said that.
17:33:08 <pnorman> he's saying that the difficulties to adding to the rails port are already high enough I think
17:33:44 <apmon> i.e. it is not worth the effort
17:34:07 <zere> afaik, the only mandatory process we go through is code review. i'm suggesting that we don't add to that. i'm suggesting that there are improvements we can make without needing to add new difficulties.
17:34:44 <zere> it's pretty clear from this that when we merge major branches, we need to pay more attention to translations
17:35:42 <zere> however, it would seem that there are improvements we can make by flagging branches and having people work on them pre-merge which won't add any extra hurdles, but will achieve the result we want
17:35:55 <apmon> I'll try and contact Siebrand to see if there are options to improve the situation and then we can discuss which things are too burdensome and which would be acceptable
17:36:28 <apmon> and how best to do it in order to minmimize the work necessary for all
17:37:09 <pnorman> faster translate round-trips right after major feature release would be a plus too, because we can be certain that with the wider audience there's going to be a bunch of translation changes
17:37:53 <zere> is there anything we can learn from other projects in this? i know KDE has a translation team to work on their i18n. is that something that we could do, or is crowd-sourcing it via translatewiki the best way?
17:39:13 <pnorman> iD went with crowd-sourcing, but using a different system
17:39:39 <apmon> I haven't seen many complaints about translatewiki other than the time lag so far
17:43:01 <zere> what's the cause of the time lag - simply that pushing changes more rapidly would cause some overload?
17:45:07 <apmon> not sure, but it seems to be not fully automated
17:48:47 <zere> ok, we wouldn't want to push this work to siebrand either... perhaps if there's some way we could help out with automating / extending the system to make it easier?
17:51:12 <zere> do we have anything else we want to talk about?
17:51:18 <zere> pnorman: any news on the benchmarking?
17:51:42 <pnorman> ya. i can't get the ec2 instance restarted with what I had loaded up
17:52:18 <zere> on EBS?
17:53:08 <pnorman> I used an instance with everything on ebs, now when I try to make a new instance with an ami based on a snapshot based on the / ebs it errors and doesn't tell me why
17:54:09 <zere> wow, that sucks. i guess this is part of learning EC2's intricacies ;-)
17:54:16 <pnorman> yes. or, in this case, not learning
17:55:43 <pnorman> i might try a new instance, install postgresql+postgis, mount the ebs with the postgres data, and shut down postgres, point itthere, start it up
17:55:52 <pnorman> I did come across something that might be useful
17:56:19 <zere> i wish i could help, but i know nothing about EC2. i would have thought someone else would have tried to do the same thing before and would have posted a solution / work-around, though. EC2 is pretty popular...
17:57:08 <pnorman> amazon has various PD and openly licensed data (including wikipedia dumps) as EBS snapshots. I sent a message asking about adding OSM data to that list
17:58:15 <zere> cool. i thought someone (stamen, mapbox?) had one of those already. but if not, or if it's out of date, definitely worth setting something like that up.
17:59:04 <zere> how do they generally do it - as a database dump, raw file, pre-imported database disk image?
17:59:19 <pnorman> ah, looks like someone is.
17:59:35 <pnorman> except they're from march.
17:59:42 <pnorman> I was searching in the wrong place
18:00:17 <pnorman> someone used to produce osm2pgsql EBS volumes, I'm not sure how
18:00:33 <zere> for example: http://stackoverflow.com/questions/2544118/openstreetmap-amazon-ebs
18:00:43 <zere> looks like they could do with some sort of automation ;-)
18:01:39 <pnorman> 3 years ago, likely when someone was providing the disk images
18:02:48 <apmon> Well, this kind of brings up the question again, if we can do something to improve the IO load on errol?
18:03:03 <apmon> e.g. get some SSDs... ;-)
18:03:43 <pnorman> hmm yes - my render times were amazingly spikey. not sure if its IO or CPU, but I think it was an order of magnitude between min and max, with std dev = mean
18:04:36 <apmon> I don't think errol is overly CPU bound atm, and the CPU schedulers usually handle overload much better than disk schedulers on rotating disks
18:05:01 <zere> this suggests it's some sort of semi-regular process: http://munin.openstreetmap.org/openstreetmap/errol.openstreetmap/diskstats_utilization/index.html
18:05:21 <zere> disk utilization never really gets to 100% for very long.
18:05:25 <apmon> that might well be the osm2pgsql update process
18:05:28 <pnorman> my guess would be bing imagery analyzer
18:06:10 <zere> did the osm2pgsql updates start mid-Jan?
18:06:37 <apmon> can't remember, but sounds about right.
18:06:46 <apmon> I have set the cron job to every 15 minutes.
18:06:54 <apmon> I can change it to every hour and see if it changes
18:07:19 <zere> that's probably it, then. i wouldn't worry about changing it, though. we'll see what's going around in terms of SSDs.
18:07:35 <zere> i've got one at home (256GB, sadly) that i'm not actually using, for example.
18:07:38 <pnorman> I know the db has been useful for running some analysis stuff and DWG stuff
18:08:14 <pnorman> I think 256 would be enough to put the flat nodes and slim tables on
18:08:43 <apmon> I'd rather suggest flat nodes and rendering tables, but either should probably fit in 256
18:08:44 <TomH> there are the two spare ones that came out of poldi last week
18:09:13 <TomH> but I'm not sure I want to be putting them in errol if that's what you're thinking
18:09:28 <zere> are they earmarked for somewhere else?
18:09:31 <TomH> that kind of implies that we're blessing one particular user on errol as more important
18:09:55 <TomH> errol is a take-it-as-it-comes everybody gets the same deal machine
18:10:08 <TomH> anything important enough to merit special hardware should be going somewhere else
18:10:14 <apmon> having an osm database on an osm dev server does kind of seem rather important... ;-)
18:10:40 <TomH> well I guess maybe if we define it as a shared resource that makes some sense
18:10:50 <zere> i'd be happy designating it as a "system service" with apmon as its admin, if that makes it conceptually easier to stomach ;-)
18:10:55 <TomH> what sort of "osm database" are we talking about?
18:10:58 <apmon> one can debate which db schema is most appropriate, but having some access to the OSM data on a dev server is imho highly desirable
18:11:13 <zere> at the moment, i believe it's a vanilla osm2pgsql database?
18:11:19 <apmon> yes
18:11:21 <pnorman> it's osm2pgsql with hstore and optional data imported
18:11:31 <TomH> well yes, but you've hit the problem on the head there - whatever schema you choose will only suit a minority of possible users
18:11:31 <apmon> but if more people would use another schema, that would be fine too
18:11:50 <apmon> but with slim mode tables and hstore, the osm2pgsql db is actually fairly flexible and complete
18:12:25 <zere> i don't see it that way. osm2pgsql + hstore + slim + -x pretty much covers most use-cases.
18:12:50 <zere> i guess the only thing missing is the node positions, if it's using flat nodes.
18:13:09 <apmon> for the devserver we could not use flat nodes
18:13:20 <apmon> which is what the wikipedia toolserver does for exactly that reason
18:14:02 <zere> sure, but it's a judgement call on whether the unused (uninteresting) node positions are worth the increased load & storage space. imho, for the large majority of users they're not.
18:14:30 <zere> anyway, having an osm2pgsql database there doesn't preclude us adding other databases (e.g: apidb) at a later date, and if the system can handle it.
18:14:39 <pnorman> to apidb: hah
18:16:13 <zere> ok. i'll see what the deal is with poldi's old SSDs and whether i need to dust off my old 256GB loaner ;-)
18:16:22 <zere> is there anything else anyone wanted to discuss?
18:16:25 <zere> #topic AoB
18:16:27 <pnorman> how about getting the most recent and 2nd most recent planet.pbf files somewhere central on errol?
18:18:12 <zere> could do. i can have a look at scripting that up.
18:18:47 <pnorman> do we have any idea of a timeline for the switch to carto?
18:19:45 <zere> i think there were some upgrades to go into orm, but after that we could switch.
18:20:07 <zere> i'd really like to know what we're getting into, though.
18:20:40 <zere> if it's still the case that it's a 30% slowdown, then even with a second machine, we would only expect to be able to render 20% more tiles than we used to.
18:20:55 <pnorman> but we'd actually be able to edit the style
18:21:20 <zere> taking into account that they won't share a disk cache, and will duplicate effort, it's probably a real terms slowdown from where we are now.
18:22:16 <apmon> Can we not set it up in a way to share disk cache?
18:22:30 <zere> yup. i'm aware of the positives - i'm just also aware that no-one really knows the negatives ;-)
18:22:56 <zere> apmon: possibly. what were you thinking?
18:23:28 <apmon> I guess one issue is that the two machines are in different datacenters?
18:23:40 <zere> bearing in mind that orm & yevaud are at different sites, but JANET is pretty damn good.
18:24:11 <apmon> It depends a bit on where the bottle neck is. Rendering CPU performance, DB access, or simply IO performance to read cached tiles from disk
18:24:26 <zere> i've used NFS as a tile store before. that was a bad idea ;-)
18:24:49 <apmon> I'd like to find out how well a rados store works.
18:25:11 <apmon> but network access only adds about 1ms of latency, compared to 10ms for a single local disk access
18:25:34 <apmon> main issue is I guess to loose local ram buffer cache
18:27:00 <apmon> I think you can use e.g. glusterfs in a way to write to it through glusterfs, which then mirrors tiles to both servers, and then read it directly through local FS
18:27:26 <zere> re latency: depends on how choked the network is with tile traffic...
18:28:16 <apmon> I guess UCL still only has 100Mbs, which might be an issue for read traffic
18:28:31 <apmon> for write traffic it would be fine
18:28:43 <apmon> So if one can set it up in a way to mirror all tiles and then always use the local mirror, things should be fine
18:29:22 <pnorman> tiles are sdb and sdc on yevaud, they sum a max of 550 read IO/sec
18:29:35 <zere> you'd need to mirror meta-information too, wouldn't you? or risk different expiry information?
18:31:43 <pnorman> write is 100 IO/sec. average request size is something like 20 KB for both, so we'd basically saturate a 100 Mbps connection if you put all the /tiles and /tiles-low traffic on the network
18:32:08 <zere> and the throughput sums to 10MByte/s over sdb & sdc
18:32:41 <apmon> The metadate is in the file times, so if all writes are mirrored the metadata would also be mirrored
18:32:50 <pnorman> would moving orm to IC help? is it an option?
18:33:06 <zere> as long as "touch" counts as a write, and all writes are totally ordered ;-)
18:33:59 <zere> moving orm to UCL or yevaud to IC would be possible, but would cut our total network throughput.
18:35:03 <zere> my hunch is that the vast majority of renders are re-renders due to updates. i might be wrong.
18:35:23 <zere> duplicating that at both sites seems like a waste, but the alternative is to use up bandwidth doing that...
18:35:55 <zere> maybe it's best just to try it and see what happens.
18:36:20 <pnorman> over the last week it's about 1.8 priority request, 1.0 dirty, .7 request
18:37:05 <pnorman> those correspond to rendered and not in cache, manual /dirty, rendered and in cache, right?
18:39:30 <zere> all those manual /dirtys? wow.
18:39:56 <zere> i don't actually know what those queues mean. apmon, has pnorman got the interpretation right?
18:42:33 <apmon> dirty also is the overflow queue for render requests
18:42:51 <apmon> priority and request are both limited to 32 slots
18:43:13 <apmon> those are the queues that attempt on the fly rendering
18:43:43 <apmon> for dirty queue mod_tile doesn't try and wait for the results, as it likely won't happen in an acceptable time
18:44:16 <pnorman> anyways, what does this mean for disk bandwidth?
18:44:19 <apmon> so when ever the server can't keep up with requests, the majority will go into the dirty queue
18:45:12 <pnorman> also, if we had a shared FS then we'd have a case where one machine going down takes out both renderers, right?
18:45:41 <apmon> The combined write traffic is at < 2MB (~20Mbit)
18:45:53 <zere> but it looks like at least half the requests go to "priority" which means that either the tile has never been rendered before, or it got removed in one of the cron cleanups?
18:46:07 <apmon> not if they are raid 0 stile with all reads going to the local mirror
18:46:31 <apmon> yes, loads of pacific z18 tiles that get viewed once
18:47:15 <pnorman> wouldn't that be raid1?
18:47:32 <zere> yeah, but i think we both understood the point ;-)
18:48:07 <apmon> sorry, yes raid 1
18:48:28 <pnorman> in that case it makes sense to me, and the write traffic isn't nearly as bad
18:49:50 <pnorman> 8 Mbps outgoing from each machine to the other, assuming its evenly split
18:50:28 <zere> would be interesting to see if that worked. i guess it would depend on timing to some extent: assuming the load balancing between them is random, then two requests for tiles in the same view needing render, but in the same metatile, may well go one to each machine
18:50:39 <zere> they'd render, then try to write at approximately the same time.
18:52:16 <apmon> you can use a central rendering queue
18:52:24 <pnorman> then about 35 Mbps going out from each. so from a network perspective it makes sense, but man, we use a lot of bandwidth
18:52:31 <apmon> then the de-duplication of rendering requests is done in that central queue
18:53:16 <apmon> one could possibly also simply go to a multi level caching hierarchy
18:53:30 <apmon> i.e. keep all of the rendering and master store at UCL
18:53:39 <apmon> then use orm as a mid level cache
18:53:44 <zere> yeah. i wonder if it makes sense to move rendering + db off to completely separate machine(s), and have tile storage as a "write-through" cache.
18:53:55 <apmon> i.e. all other caches use orm as the parent and orm then uses yevaud as a parent
18:54:17 <zere> i think that's the way it was set up for a long time
18:54:34 <apmon> don't all caches hit yevaud directly?
18:55:03 <pnorman> I tried to explain the levels of caching to someone in #osm once with the CDN, on disk, multiple render queues, etc
18:55:16 <zere> Firefishy is the one who'd know, but i think it switched to multi-level via orm for a while, then to multi-parent.
18:55:35 <apmon> ah, nice
18:55:57 <pnorman> and what's getting upgraded on orm before its ready?
18:55:59 <apmon> although cache expiry gets complicated, as I don't think any of renderd's feature for htcp are used
18:56:48 <apmon> Is Firefishy going to SotM-US?
18:56:54 <zere> pnorman: disks + ram
18:57:02 <zere> apmon: i believe so, yes.
18:57:04 <apmon> are you?
18:57:15 <zere> it looks unlikely at this point
18:57:19 <apmon> Perhaps we could sit together and try and hash out a workable plan
18:58:50 <zere> there's certainly a lot to try and get done.
18:59:00 <zere> and i notice that we're way over time...
18:59:12 <pnorman> so does my stomach, haven't had breakfast yet.
18:59:32 <pnorman> maybe take this discussion to tile-serving@?
18:59:39 <zere> ditto mine, but for dinner.
19:00:44 <zere> cool. thanks to everyone for coming, and see you next week!