Archive for January, 2023

Grooming the backlog, planning, and starting February sprint

January 31, 2023

Today I groomed the backlog a bit for the sandbox project, closed the last sprint, and started a new one.

Previous Goals

The goals for the January sprint for the sandbox project were:

  • Add redis cache to Tokengraph
  • Implement temporary join codes for instances
  • Implement basic account limits

I succeeded on the first two objectives but not the third. In retrospect I believe that it is too early to implement account limits yet, I’d really like to get payments working end to end first.

New goals

The goals for the February sprint for the sandbox project will be:

  • Integrate Payments end to end
  • Secure Coturn
  • Make inroads on dealing with comms memory leaks

January retrospective

January 30, 2023

General morale

A Parrot of Spring Dawning*

A bit morose but generally optimistic. As I write this on the 27th of August 2022 things have been progressing well.

What went well?

This month I:

  • Implemented a redis cache for Tokengraph,
  • Removed the use of in-memory objects from Tokengraph,
  • Fixed streaming of data throughout the procedural generation pipeline,
  • Added the ability to create a shareable link to an instance with within the instance as well as from the instance selection screen.

It has been a fairly productive sprint; I am certainly satisfied with the amount of progress coming out of it.

What didn’t go so well?

Things went pretty well actually.

What’s the outlook?

Reasonable.

My basic plan is:

  • Start to action the plug comms memory leaks epic,
  • Plumb payments from the client through to the organisation service (which has a link to Stripe),
  • Front the Coturn server with AWS API gateway. This is important to secure the service.

After I have a rudimentary payments workflow in place I can then start to think about maybe implementing limits on overall CCU (across all instances), instance CCU (for a single instance), rate limit calls to the creative subsystem, place limits on instance size (if / when configurable), and also maybe organisation seats.

The road ahead

Per the roadmap, to slightly extemporise upon the points from here:

Near term (from now through to the end of the April 2023 sprint).

  • Integrate payments with the UI. Although I’ve established a connection between the backend and the Stripe payment processor gateway service, I still need to integrate this with the UI. This should be moderately straightforward to do now, basically I just need to ensure that I hit the correct routes on the Organisation service from comms invoked via the UI – and facilitate a semi-reasonable UX using control nodes for it. (February)
  • Plug memory leaks in comms. It would be good to make a start on this. (February -> April)
  • Secure coturn. Front coturn with AWS API Gateway. (February)
  • Deployment of services. Terraform code for Tokengraph and the Protongraph-Provider. (March)
  • Implement basic limits. Make a start on implementing basic limits. (March -> April)
  • Improve development environment. Starting to become a hassle, I’ll look towards containerising a few things and improve upon my bash scripts for local development. (March -> April)
  • Create several basic generators. Sketch several tpgns for basic testing (forest, city, road) and add these to the palette. (For “city” this won’t use the city generator service, which is a much more ambitious planned undertaking for 2024; one should hopefully be able to get away with a more primitive approach for now, using native Tokengraph features.) (April)

Not so near term (from the May 2023 sprint onwards through to December 2023, i.e. likely Q1 ’23 -> Q2 ’23 in real time).

  • Procgen CRUD actions. Can move / rotate / delete procedurally generated object groups. (May -> July)
  • Deployment of services. Deployment of everything that remains, leveraging terraform and other techniques. (May -> June)
  • Improve avatar / shadow functionality. There are a few bugs associated to this logic, increase test coverage in the client around this feature, improve the readability and correctness of the code, and fix any identified issues. (May -> June)
  • Resolve memory leaks in comms. Close out all identified leaks in comms so that one can be 95% confident that all the key ones are plugged. (May -> August)
  • Enforce basic limits. Complete the work on implementing basic limits (with different limits depending on free or paid organisation account status). (May -> August)
  • UI / UX polish. Continue to chip away at improving the UI within the client. (May -> August)
  • Basic test coverage for client. Test coverage for the client in place, and client coverage at 20% overall. n.b. the tech to measure gdscript test coverage leveraging GUT doesn’t currently exist, but it should hopefully by the time I get around to looking into it again. (August -> October)
  • Basic test coverage for book-keeping subsystem services. Rudimentary test harnesses in place to focus on controllers within the instance service, the user service, the organisation service, and the payment service respectively. Test coverage hovering at about 10% for each of these services, and ideally pushing past 20%. (July -> September)
  • Basic content available for use. Basic content available to users in palette (users should have a reasonably varied range of options to choose from in order to populate their instances). In particular, manually add, wire up and configure a basic selection of canned assets, and try to architect things with an eye for future generalisation and extensibility. (June -> August)

Deferred

  • Fix previews in the procgen engine UI. The original alpha implementation of Protongraph had a feature wherein one could preview procedurally generated results, i.e. the output of the datagraph (tpgn file). To fix this without transmitting information over the wire, I will need to figure out how to work around the gatekeeper limitations in macos if applicable, fix compilation of the third party mesh_optimizer library (https://github.com/zeux/meshoptimizer), and introduce a workflow wherein if the standalone application accesses a tpgn file, then it makes default assumptions about the location of assets; alternatively there might be a metadata attribute in the tpgn that describes where to find as relative paths the locations of relevant assets. (Deferred, 2024)
  • Instance configurability. (Not complex but not absolutely necessary for a prototype, so kicking this down the road). Try to improve the user experience in an instance, eg by making the size of it configurable (and have this tied to limits for an organisation gated by subscription level). See if a terrain mesh can be set for an instance too maybe? Probably not too complicated but having a degree of basic configurability before creating an instance would be good. (Deferred, 2024)
  • Parametrised procgen + improved procgen experience. Probably not a top priority for the initial prototype, I’ll punt this further down the road. Certainly though having configurable generation within the UI of things like cities is definitely something I’m very keen to implement. Under this milestone is introducing the abstraction of a ServiceNode type in tpgn graphs for calling out to separate services (like the City Generator Service based loosely perhaps on this). (Deferred, 2024).
  • Recomputation of procgen object group. Can recompute a procedurally generated object group with different parameters (maybe out of scope, to be decided). (Deferred, 2024).
  • Procgen placement previews. Mouse over preview of where things will be moved / rotated to. This should be possible by judicious use of collision masks (because I don’t want avatared tokens to be blocked by pending clipboard pastes, if I end up networking the preview view), as well as using ray tracing to detect a collision as to where the token or object group will go – the same way I place these things currently, just not “locked in”. (Deferred, 2024).
  • Marketplace. Allow users to upload their own assets / procgen algorithms etc. Maybe support a marketplace. (Deferred, 2025).

Summary

In short, I think I’m getting close to the end of the beginning. Evidently the above is a slightly lengthy list, but to narrow things down along the lines of this post, the main pieces of complexity remaining to solve for are these:

  • Procgen CRUD actions
  • Memory leaks in Comms
  • Payment integration with the UI
  • Limits

One of these (payments) should be more of less done by the end of the next sprint.

* Not to be confused with a Parrot of Summer Flame

Creating + breaking down comms memory leak epic

January 29, 2023

Today I had a look at the current sprint, the next couple of planned sprints, and I realised there were a few things there that I didn’t want to prioritise at present. So I moved a few things out, and a few things in.

To speak indirectly regarding what I moved in, one of the things I’d like to focus on next from a technical debt perspective is finding and plugging the memory leaks in comms. As mentioned earlier here and here, there are a few in the system that ideally shouldn’t be there.

Certainly having this emitted by comms while testing:

<--- Last few GCs --->

[88283:0x7fcce8008000]   665654 ms: Scavenge 1888.0 (1924.0) -> 1888.0 (1924.0) MB, 39.2 / 0.0 ms  (average mu = 0.976, current mu = 0.999) allocation failure
[88283:0x7fcce8008000]   665725 ms: Mark-sweep 1899.8 (1935.9) -> 1003.3 (1045.1) MB, 44.8 / 0.0 ms  (+ 0.2 ms in 5 steps since start of marking, biggest step 0.1 ms, walltime since start of marking 71 ms) (average mu = 0.983, current mu = 0.991) finalize

<--- JS stacktrace --->

FATAL ERROR: invalid table size Allocation failed - JavaScript heap out of memory

is not ideal.

I had a task or two around this general theme, so I consolidated things a bit and decided to focus on a few key areas:

  • Unit tests for the classes and class methods, and incrementing overall test coverage in multiples of 5%,
  • Unit tests for a separate “leakage” coverage test harness, with a focus on test coverage for class methods only,
  • Work to pinpoint and remove any in-memory dictionaries, arrays, or objects that are prone to making comms leak.

Breaking things down is helpful, and I feel that this theme of work for the next few sprints is now much better defined and has a more concrete plan of attack. Indeed, this is one of the primary remaining bugbears from a technical debt perspective. Once this is out of the way, I guess there are questions regarding test coverage in the client, but having bugs in comms that can lead to termination of the process are more serious and should be dealt with first.

Removed peers in remote manager using redis calls

January 28, 2023

Today I succeeded in removing the _peers dictionary in remote manager using redis calls.

A good day!

Fixed parse of metadata

January 27, 2023

In typical anti-climactic fashion, I found that I wasn’t dealing with the parsed stringifiedMetadata information properly. Once I did that, things worked!

So that concludes the work with Redis; now Tokengraph is not using an in memory dictionary to keep track of job information – which is a good thing.

There might be dictionaries elsewhere in Tokengraph that might bear scrutiny, but I think this is a good starting point. Yes, there is a place where I set clients to a dictionary in remote_manager.gd which could bear fixing … probably one or two other places too.

But I’ve set out the core of what I originally wished to achieve with this, so I’ll create a follow-up task for a later sprint and declare victory for now.

Fixed read from Redis

January 26, 2023

Today I fixed the read from Redis, it turns out that I was using the wrong implementation of gdscript range.

However, it looks like the structure of the data emitted from Tokengraph to Kafka is malformed, I will need to look into that and ensure that I’m sending the sort of packet that is expected by comms and also by the ProtongraphConsumer. Certainly the returned response is not being rendered, and the instance service is complaining about a route not existing — the former indicates that the response has the incorrect structure, and the second indicates that there are params in the response that are not accepted by the route.

I’ll attempt to debug and fix those things tomorrow.

Wrapped executable in Valgrind + defer

January 25, 2023

Today I:

  • Used shc to compile my original shell command to run the headless binary in the container, and reduce it to a wrapped binary,
  • Installed valgrind in the application container and invoked the wrapped binary using it with the default memcheck option,
  • Used a custom Defer implementation from here: https://gist.github.com/skaslev/4b6dd4166e5cc88a6721 in order to simplify garbage collection in my hiredis gdnative methods.

The memory issues appear to now have been resolved, and Valgrind is not flagging anything – although that might be due to the fact that the binary I’m interested in is wrapped within the binary it is checking, so maybe my use of it is incorrect? Regardless, no more complaints about memory, at least for now. So now I just have the JSON.parse issue remaining.

Certainly all the information is being stored in Redis now – indeed the get and incr commands seem to be working well.

However maybe my gdnative wrapper around hiredis get is not up to snuff?

The long trudge to compilation

January 24, 2023

Today I removed the submodule dependencies and made the code native. It was a fair bit of work massaging SConstruct files in order to get things to compile properly. I also needed to rename a string.h file in godot-cpp to cstring.h to prevent a collision with the string.h g++ library.

What I’d like to do next is to run Valgrind in a docker container (eg using https://github.com/karekoho/valgrind-container) and see if I can start debugging memory leaks within the application that way. Maybe macgrind (https://github.com/kokkonisd/macgrind) might be easier?

And then I need to find the memory leaks. This could be a good reference https://www.cprogramming.com/debugging/valgrind.html, or I could always just read the docs: https://valgrind.org/docs/manual/quick-start.html.

Memory leaks in gdextension

January 23, 2023

It looks like what I am facing is a memory leak in gdextension: https://github.com/godotengine/godot/issues/40957#issuecomment-907511765. It looks like there are a few open issues of this ilk in godot-cpp, possibly a partial answer is to update the version of godot-cpp used in tokengraph to something a bit more up to date.

For

corrupted size vs. prev_size

it seems that there is a workaround in the interim to just call

godot::api->godot_free(utf8_string);

For impacted strings, I might try this and see if that helps me progress in the interim.

Different error now!

free(): invalid next size (fast)

… and another one! Seems like I’m collecting them all.

realloc(): invalid next size

Apparently Valgrind is supposed to be useful for debugging these things ??? I might look into that, but first I might try one more thing.

Anyway I’ve opted to update to the latest godot-cpp and godot-headers per the 3.5 branch in each repo accordingly, and also remove the gitsubmodules for them. I’ll keep trying to resolve this issue, if I still have issues I will upgrade to the godot-cpp and godot-headers master branches, i.e. gdextension.

… updated to godot-cpp master. Header includes were a bit too tricky to work out otherwise.

Incremental steps towards redis in tokengraph

January 22, 2023

Today I:

  • Altered the type signature of hiredis.get so that it would return a String,
  • Added an incr (increment) function,
  • Removed the use of a dictionary to store job state in the Tokengraph frontend,
  • Started prototyping the use of the hiredis module methods therein.

I made a reasonable amount of progress but ended up getting a bit stuck on a couple of things:

Hopefully tomorrow I’ll be able to make some inroads in resolving these remaining matters.