Archive for February, 2023

Grooming the backlog, planning, and starting March sprint

February 28, 2023

Today I groomed the backlog a bit for the sandbox project, closed the last sprint, and started a new one.

Previous Goals

The goals for the February sprint for the sandbox project were:

  • Integrate Payments end to end
  • Secure Coturn
  • Make inroads on dealing with comms memory leaks

I make partial progress on two of these.

New goals

The goals for the March sprint for the sandbox project will be:

  • Integrate Payments end to end.
  • Switch coturn for pion/turn.
  • Start prototyping comms golang rewrite.

February retrospective

February 27, 2023

General morale

Generally optimistic. As I write this on the 24th of September 2022 things have been progressing well.

All the ways of a man are pure in his own eyes,
    but the Lord weighs the spirit.
Commit your work to the Lord,
    and your plans will be established.

Proverbs 16:2-3, ESV UK

What went well?

This month I:

  • Implemented a way to test memory leaks in comms,
  • Set up datadog with comms,
  • Looked into coturn, built it locally, and attempted to resolve a security issue with it, then opted to switch to github.com/pion/turn instead.

Slightly slow this month, largely due to wrestling with third party codebases and seeking to understand them.

What didn’t go so well?

Things were a bit slow this month, but I did do a fair amount of careful thinking about how to deal with memory issues in comms, as well as some basic security concerns.

I’ve realised that likely I face a significant setback, in that in order for comms to do what I want, without me needing to

1) plug a large number of antipatterns + memory leaks,

2) convert it to docker & deploy via kubernetes and

3) likely spend a small fortune on k8s replicas in order to service a relatively tiny amount of requests,

I should strongly consider rewriting it in Golang. I may still try to do the above, but I think it is rational at this point to start investigating a rewrite and implementing some basic functionality in one. So I will start looking into doing these things in parallel.

Another matter which had a mixed outlook was the sunken cost on coturn. I did however make some inroads that should inform some issues I have with compiling protongraph, but at least now I know that I should switch to a different and more modern turn server.

What’s the outlook?

Reasonable.

My basic plan is:

  • Start a prototype Golang comms rewrite while chipping away at the technical debt with the existing Nodejs comms,
  • Plumb payments from the client through to the organisation service (which has a link to Stripe),
  • Swap out Coturn with Pion/Turn.

The road ahead

Per the roadmap, to slightly extemporise upon the points from here:

Near term (from now through to the end of the April 2023 sprint).

  • Integrate payments with the UI. Although I’ve established a connection between the backend and the Stripe payment processor gateway service, I still need to integrate this with the UI. This should be moderately straightforward to do now, basically I just need to ensure that I hit the correct routes on the Organisation service from comms invoked via the UI – and facilitate a semi-reasonable UX using control nodes for it. (March)
  • Plug memory leaks in comms. It would be good to make a start on this. (February -> April)
  • Secure turn server. Swap out coturn with pion/turn. (March)
  • Deployment of services. Terraform code for Tokengraph and the Protongraph-Provider. (April)
  • Implement basic limits. Make a start on implementing basic limits. (May -> June)
  • Improve development environment. Starting to become a hassle, I’ll look towards containerising a few things and improve upon my bash scripts for local development. (April -> May)
  • Create several basic generators. Sketch several tpgns for basic testing (forest, city, road) and add these to the palette. (For “city” this won’t use the city generator service, which is a much more ambitious planned undertaking for 2024; one should hopefully be able to get away with a more primitive approach for now, using native Tokengraph features.) (April)

Not so near term (from the May 2023 sprint onwards through to December 2023, i.e. likely Q1 ’23 -> Q2 ’23 in real time).

  • Procgen CRUD actions. Can move / rotate / delete procedurally generated object groups. (May -> July)
  • Deployment of services. Deployment of everything that remains, leveraging terraform and other techniques. (May -> June)
  • Improve avatar / shadow functionality. There are a few bugs associated to this logic, increase test coverage in the client around this feature, improve the readability and correctness of the code, and fix any identified issues. (May -> June)
  • Resolve memory leaks in comms. Close out all identified leaks in comms so that one can be 95% confident that all the key ones are plugged. (May -> August)
  • Enforce basic limits. Complete the work on implementing basic limits (with different limits depending on free or paid organisation account status). (May -> August)
  • UI / UX polish. Continue to chip away at improving the UI within the client. (May -> August)
  • Basic test coverage for client. Test coverage for the client in place, and client coverage at 20% overall. n.b. the tech to measure gdscript test coverage leveraging GUT doesn’t currently exist, but it should hopefully by the time I get around to looking into it again. (August -> October)
  • Basic test coverage for book-keeping subsystem services. Rudimentary test harnesses in place to focus on controllers within the instance service, the user service, the organisation service, and the payment service respectively. Test coverage hovering at about 10% for each of these services, and ideally pushing past 20%. (July -> September)
  • Basic content available for use. Basic content available to users in palette (users should have a reasonably varied range of options to choose from in order to populate their instances). In particular, manually add, wire up and configure a basic selection of canned assets, and try to architect things with an eye for future generalisation and extensibility. (June -> August)

Deferred

  • Fix previews in the procgen engine UI. The original alpha implementation of Protongraph had a feature wherein one could preview procedurally generated results, i.e. the output of the datagraph (tpgn file). To fix this without transmitting information over the wire, I will need to figure out how to work around the gatekeeper limitations in macos if applicable, fix compilation of the third party mesh_optimizer library (https://github.com/zeux/meshoptimizer), and introduce a workflow wherein if the standalone application accesses a tpgn file, then it makes default assumptions about the location of assets; alternatively there might be a metadata attribute in the tpgn that describes where to find as relative paths the locations of relevant assets. (Deferred, 2024)
  • Instance configurability. (Not complex but not absolutely necessary for a prototype, so kicking this down the road). Try to improve the user experience in an instance, eg by making the size of it configurable (and have this tied to limits for an organisation gated by subscription level). See if a terrain mesh can be set for an instance too maybe? Probably not too complicated but having a degree of basic configurability before creating an instance would be good. (Deferred, 2024)
  • Parametrised procgen + improved procgen experience. Probably not a top priority for the initial prototype, I’ll punt this further down the road. Certainly though having configurable generation within the UI of things like cities is definitely something I’m very keen to implement. Under this milestone is introducing the abstraction of a ServiceNode type in tpgn graphs for calling out to separate services (like the City Generator Service based loosely perhaps on this). (Deferred, 2024).
  • Recomputation of procgen object group. Can recompute a procedurally generated object group with different parameters (maybe out of scope, to be decided). (Deferred, 2024).
  • Procgen placement previews. Mouse over preview of where things will be moved / rotated to. This should be possible by judicious use of collision masks (because I don’t want avatared tokens to be blocked by pending clipboard pastes, if I end up networking the preview view), as well as using ray tracing to detect a collision as to where the token or object group will go – the same way I place these things currently, just not “locked in”. (Deferred, 2024).
  • Marketplace. Allow users to upload their own assets / procgen algorithms etc. Maybe support a marketplace. (Deferred, 2025).

Summary

The setback I’m facing should hopefully not slow down things too much, and I should still be able to get to alpha by June 2023 in actual time.

Narrowing things down along the lines of this post, the main pieces of complexity remaining to solve for are these:

  • Procgen CRUD actions (~ 20 points)
  • Rewriting comms in Golang to make it a robust production service (~ 100 points)
  • Payment integration with the UI (~ 10 points)
  • Limits (~ 20 points)

One of these (payments) should be more of less done by the end of the next sprint.

150 points should be actionable in theory. Assuming that I end up needing to do another 200 points, and I get through about 20 points per sprint on the lower side of things, that means I will need 10 more sprints to get things to “alpha ready” state. That basically will take me through to the end of the December 2023 sprint.

Taking into account an additional margin of 2 months on top of that, that should see me getting to alpha conservatively by the end of the February 2024 sprint. i.e. another 12 sprints. If I can increase my lead by another 3 months (from the existing lead of 5 months to 8) by June 2023 in actual time, which is potentially doable, I should be alpha prototype ready by the end of June 2023 in actual time.

Anyway we’ll see how it all pans out in the wash.

Pion turn instead of coturn?

February 26, 2023

I’ve continued looking at coturn and, although I managed to get unauthenticated stun requests working locally again, I continue to be perplexed by how to do security with it. Looking on the web I found this alternative written in golang: https://github.com/pion/turn . A bit more modern and, although fairly low touch at present and less battle-tested, potentially has the features I am after – namely, security for a turn server.

With regards to coturn, –secure-stun which is a flag that is apparently supposed to secure coturn doesn’t work, and apparently is “blocked” on this mozilla bug. Not sure that I buy that though.

The coturn code is very complex, very old, and very messy, i.e. legacy code. Which makes sense as it is an old project. It is old on github, and existed for many years before that – additionally, as I understand it, the current maintainers are not the original developers who wrote the project. I’ve worked on legacy systems myself before. I know how difficult they can be to refactor and change, to tidy up and to modernise. Legacy systems are often battle-tested and widely used, and often the existing userbase just wants basic maintenance patches, they don’t want any major improvements – or any form of significant change at all really. So I empathise with the pain faced by the current maintainers.

Certainly though, services written in legacy code are not necessarily a good choice for a greenfield system, such as what I am building.

Just because something is legacy, however, should not necessarily be a show-stopper to using it in a new project. And I did have a good go at it. But disabling the stuns and stun lines in “get_default_protocol_port” within ns_turn_utils for coturn doesn’t block unauthenticated stun as I wished – and there seems no obvious documented or supported path to making authenticated requests, or even an established pattern to secure coturn – at all – despite multiple questions over many years by developers in the coturn issue github knowledge base, which have been more-or-less unilaterally dismissed by the maintainers. Very confusing and a tad perplexing.

I’m put in mind of one of the seasoned developers I know who told me once: “If a tool doesn’t do what you need, it is not incumbent on you to fix it.” I guess my slightly more admissive take on this is that: if a tool doesn’t do what I need, and it is not moderately straightforward to fork and fix, then maybe it is worthwhile looking into using a different tool.

The pion/turn docs on secured requests do look very straightforward. I might try giving this project a spin and see how I “go”.

Resolved a number of issues

February 25, 2023

Today I:

  • Fixed the compilation issue. Turns out that I was running ./configure with the gcc-build-coturn image mounted, but then was running make without it! Rookie mistake, easily fixed. There is an ar -a complaint, maybe an autoreconf or autoconf could fix that.
  • Fixed the certificate issue. This was moderately straightforward, I just needed to install a few more libraries in the coturn image and then run an openssl command, and set a couple of lines of configuration. Easy peasy.
  • Removed the -u turnserver part in the CMD docker command in the Dockerfile to fix an error.
  • Resolved another issue regarding a default setting to facilitate telnet testing.

However the WebRTC connection from the client is not working for stun, turn, 5349 or 3478 ports. Puzzling. I suspect some sort of caching issue, because the test node process can connect easily enough via stun on port 3478. Confusingly, the remote coturn process doesn’t seem to work now for turn connections, so definitely some form of caching issue I think. Whether that is in the WebRTC module within the client or elsewhere, that will require investigation + further research.

Maybe I can test turn connections with username and a credential by leveraging telnet?

So still to resolve:

With regards to the malformed archive problem, it appears to be due to the fact that I have one or two packages that are “thin” libraries, and some are in ELF 32 vs ELF 64 format when being linked together with the ar tool, see here. One workaround is to run the ar tool on each .a file individually, see here and also here.

For the record:
for lib in `find . -name ‘*.a’`;
do ar -t $lib | xargs ar rvs $lib.new && mv -v $lib.new $lib;
done
These command can convert thin libraries to normal libraries.(https://stackoverflow.com/questions/25554621/turn-thin-archive-into-normal-one)

https://bugs.chromium.org/p/webrtc/issues/detail?id=5022#c8

This probably doesn’t need to be fixed, maybe I can give this a miss – as something similar cropped up when building protongraph.

Fixing coturn shenanigans

February 24, 2023

Today I:

  • Tested whether logs are being written. They are not. Requires more investigation, evidently.
  • Added start, stop and debug scripts.
  • Looked into an old PR https://github.com/coturn/coturn/pull/67/files for adding configuration. Still didn’t work! Puzzling!
  • After just commenting out stuns and stun (which was the whole point of turn-only), rebuilding, and then testing with coturn-test (the small node service), stun didn’t work. However, turn:0.0.0.0:3478 with username:password did. Which does (more-or-less) what I set out to do.

So this is probably good enough, I’d still like to sort out logs though. And I am a bit nerd-sniped by the fact that turn-only is “bad configuration”, why is that? After renaming this to use-turn-only, still the same problem. A bit stumped!

Setting “verbose” in turnserver.conf gives logs, but now stun and turn don’t seem to be working in my client … confusing. Probably a cache thing, I’ll look into this again tomorrow.

So, still to-do:

  • First and foremost, prototype turn only connection
  • Fix certificates in docker container
  • Figure out bad configuration + hook up options properly
  • Proper compilation of the service, maybe leveraging SConstruct. For some reason some of my macos filesystem is being used to lean on for dependencies like OpenSSL, and things are falling over when looking for linux only libraries like systemd/sd-daemon.h. I shouldn’t be encountering these issues, everything on build should be relative to my gcc build image, which is a linux container.

A slight setback: planning a comms rewrite

February 23, 2023

I’ve been reflecting on the comms memory leak issue, and, although it is possible that I could address some of the anti-patterns that have led to this matter, I think the issue would remain – comms needs to be a highly performant service, and it needs to be robust from a garbage collection and usage of memory point of view.

For this reason, I think it might be necessary to rewrite it in a language that lends itself to better performance and robust garbage collection. Choices for this could be C++, Rust, or Golang.

I haven’t programmed in Rust, and C++, although very modern, has its downsides, particularly with memory leaks of its own. Although one can resolve the issues along such lines in C++ with libraries such as Oilpan, I think the better choice is Golang. Besides, many of the systems I’m deploying (a couple of the consumers, coturn, the client) are already written in C++, so some variety would be useful to play with from a modernisation point of view.

However, I probably won’t go “all-in” on a rewrite just yet. But I think starting to explore what such a service could look like, to the point where it can facilitate Coturn OAC handshakes is probably a good idea. I’ll aim to look into this over the next couple of sprints.

I guess the learning here is that node.js is a good option for quickly prototyping a system, but perhaps slightly suboptimal when aiming to build a robust and high throughput production service.

Backtracking a bit

February 22, 2023

So I found the docker images shipped with coturn not entirely helpful as I couldn’t build them straight out of the box. It seems that docker-compose is the name of the game with coturn, but I don’t want to run 5 or 6 containers, I just want to run one.

I had another look and started playing around with a small node service in order to check the connection, things work with this code, leveraging this package: https://www.npmjs.com/package/stun.

const stun = require('stun');
// stun.request('stun.l.google.com:19302', (err, res) => {
//     if (err) {
//       console.error(err);
//     } else {
//       const { address } = res.getXorAddress();
//       console.log('your ip', address);
//     }
// });

stun.request('0.0.0.0:3478', (err, res) => {
    if (err) {
      console.error(err);
    } else {
      const { address } = res.getXorAddress();
      console.log('your ip', address);
    }
});

Sure enough with stun:0.0.0.0:3478 the client could connect too, so evidently coturn is doing what it should. Port 5349 doesn’t work though, i.e. doesn’t facilitate networking properly, I think that’s because I have an issue with certificates. Of course I should doublecheck turn:0.0.0.0:3478 username:password before proceeding.

I did need to expose the ports via docker run -p 3478:3478 -p 3478:3478/udp -p 5349:5349 -p 5349:5349/udp -p 49152-50000:49152-50000/udp coturn , I might want to script that properly. Also scripting jumping into the running container and/or stopping it, I should do that too.

I’m not getting logs generated though on connection, and turn-only is still bad configuration despite my best efforts.

And I will want to front the whole thing with nginx eventually when I deploy it via terraform, as it will be running in docker on the target VM.

So: certificates, logs, new configuration not working, helper scripts. And the question of modifying the existing terraform deploy script so as to leverage nginx and docker, but that can wait for now.

Regarding logs, probably what is happening is that the user running the coturn process doesn’t have permissions to write to the log file. Now I am creating a user called ‘turnserver’ and giving it those permissions, but I’m not sure I’m running the executable as it. Should be easily fixed by inserting -u turnserver in the appropriate spot:

CMD ./usr/coturn/bin/turnserver -c /etc/turnserver.conf -u turnserver --pidfile /run/turnserver/turnserver.pid

Certificates are not critical, as the whole thing will be fronted by nginx eventually.

The new configuration matter is a bit more serious, it would be helpful to unpack why things are not working properly there.

RTFC, a docker coturn discovery

February 21, 2023

Today I discovered that, well, there is already code written to do more or less what I want: https://github.com/coturn/coturn/tree/7486e503748b8d8ca45b4e555db7ff62ec850ffd/docker/coturn .

All I need to do is to hack on top of that.

Tomorrow I’ll pick up the pieces of what I’m working on and switch to leveraging either the alpine or debian docker builds.

Various findings including Godot 4.0 beta1

February 20, 2023

Godot 4.0 beta milestone reached

  • Godot 4 master was bumped to beta
  • Some coverage of the first Godot 4.0 beta from Game from Scratch: “The wait for Godot 4 just got shorter, with the first official beta released and a feature freeze. What you see in this beta should be pretty much the same as what ships with Godot 4. Of course if you encounter any bugs or problems, be sure to report them. The all new Vulkan based renderer is a key feature of Godot 4, although there are literally hundreds of other new features.”
  • The WebRTC plugin is being updated for Godot 4.0, which is good news for developers who happen to be building something (like a sandbox project) that relies on this functionality.
  • It appears that custom resources are going to be in Godot 4.0 beta 2.

Wendelstein news, DeepRoute.ai, jszmq, + other miscellaneous

  • Level 4 autonomous vehicles apparently are now a thing (via DeepRoute.ai). (In the following video it is possible more impressively that maybe an explainer ai system was running on top of the underlying vehicle management system in real time.)

Continued attempts at getting coturn to work locally

February 19, 2023

Today I continued plugging away at trying to get coturn to work locally. I discovered in particular that it seems that any port works for stun or turn locally, so evidently my initial success was perhaps not nearly quite!

I managed to fix a couple of problems such as connecting to sqlite within the docker process. I also spent a large amount of time trying to fix an “openssl/md5.h” not found error. I discovered in testing that ‘turn-only’ is bad configuration.

Bad configuration format: turn-only

Overall results are as yet inconclusive.

Update: having continued to wrestle with compilation a bit more, I’ve started to conclude that maybe I should look into introducing scons tooling to compile, following my experiences with protongraph. In particular I shouldn’t have to mount paths to libraries in osx if I’m compiling for linux; such results in nonsenses such as needing to get systemd/sd-daemon.h from osx, but of course this is a linux only thing.