Showing posts with label source code. Show all posts

Configuring Gradle publishing with AWS S3

Using S3 as a remote maven repository can provide a cost effective alternative to some of the hosted solutions available - admittedly you won’t get tooling like Artifactory sitting on top of the repository, but if you are used to using your own privately hosted remote maven repositories, then the S3 solution is a good option if you want to move it off premises.

Thankfully, the more modern versions of Gradle come with good support for AWS built in - if you are using Gradle pre-version 3 or so, then some of these may not be available, but given we are approaching Gradle 7, then it’s probably a good time to upgrade your tooling.

Out of the box, publishing to an S3 bucket is incredibly easy, but of course, you will want to secure your bucket so its not open for the world to push their artifacts to - and Gradle combined with the AWS SDK make this pretty simple.

If you want Gradle to push to S3 using the normal profile credential chain (checking env vars before using default credentials etc) then its really easy:
The above approach, running gradle publish will attempt to publish the artifact to a maven repo in the named S3 bucket, and will look locally on the host machine for AWS credentials - this will require that you have default credentials set.

Now this might not always be ideal, if for example you are running on a machine whereby you have default credentials set to some other account (not the dev tooling account where you are hosting the maven repo) and you don’t want to have to change the default credentials just to push stuff to maven. As you will have seen in the AWS docs around the profile credential chain, you can override the default profile using an environment variable AWS_PROFILE - which is a possible solution but not ideal for a couple reasons:

  1. Users have to remember to set it if they want to avoid using default credentials

  2. It isn’t very granular - setting the env var at gradle runtime (by exporting/set the variable on the command line before running the gradle command) sets the variable for the entire gradle task(s) being run - there maybe situations where you want finer control and need to use different credentials for AWS for different jobs (I have seen this requirement in real world code)

  3. Dealing with environment variables in Gradle doesn't have great support - some tasks can set them easily, others not so much (publish, for example, doesn’t support it)

Thankfully, there is still a trivial mechanism to override this - although it took me a while to stumble upon this solution - most stackoverflow questions and gradle issue discussions mostly have examples using the environment variables.

The following will use the named credentials “buildtool” on the host machine to publish. Of course, this is a simple string hardcoded, but this could be passed in using other custom arguments, sidestepping the need for env vars if you wanted to override it (create a custom -D JVM argument, for example, to override - which is much more granular and specific to your usage and much better support on Gradle).

Note you also need to import the AWS SDK to use this one:

Re-thinking the Visitor pattern with Scala, Shapeless & polymorphic functions

In this article I will look at a relatively boilerplate free way to traverse tree structures in Scala, using polymorphic functions along with Shapeless' everything function.

Over the course of my career, a problem that I have had to face fairly repeatedly is dealing with nested tree like structures with arbitrary depth. From XML to directory structures to building data models, nested trees or documents are a common and pretty useful way to model data.

Early in my career (classic Java/J2EE/Spring days) I tackled them using the classic Visitor pattern from the Gang of Four and have probably had more than my fair share of implementing that pattern, then whilst working in Groovy I re-imagined the pattern a little to make it a little more idiomatic (dealing with mostly Maps and Lists) and now I am working in Scala, and once again the problem has arisen.

There are lots of things that Scala handles well - I do generally like its type system, and everyone always raves about the pattern matching (which is undeniably useful), but it has always irked me a bit when dealing with child classes that I have to match on every implementation to do something - I always feel like its something I should be able to do with type classes, and inevitably end up a little sad every time I remember I can't. Let me explain with a quick example, lets imagine we are modeling a structure like XML (I will assume we all know XML, but the format essentially allows you to define nested tree structures of elements - an element can be a complex type e.g. like a directory/object, that holds further children elements, or a simple type e.g. a string element that holds a string).

Above is a basic setup to model a tree structure - we have our sealed trait for the generic element, and we then have a class for the complex element (that is an element that can have further list of element children) and then a couple basic classes for the simple elements (String/Boolean/Double).

Now, when we have a ComplexElement and we want to process its children, a List[Element], ideally type classes would come to our rescue, like this:

Above we have a simple ValidatorTypeClass for which we define our implementations for all the different types we care about, and from there, it looks relatively simple to traverse a nested structure - the type class for the ComplexElement simply iterates through children and recursively passes to the child element type class to handle the logic (note: I will use validation as an example throughout this article, but that is just for the sake of a simple illustration - there are many better ways to perform simple attribute validation in Scala - but helps provide an example context for the problem.)

However, if you run the above code, you will get an error like this:

The reason is, it's looking for an implicit type class to handle the parent type Element (ComplexElement value attribute is defined as List[Element]), which we haven't defined. Sure, we could define that type class ValidatorTypeClass[Element], and simply pattern match the input across all the implemented types, but at that point there's no point having type classes, and you just end up with a big old pattern matching block - which is fine, but it feels kind of verbose, especially when you have to have the blocks repeated throughout the code as you inevitably have to handle the tree structure in several different places/ways.

So I wanted to find a better way, and having written about Shapeless a couple times before, once again..

Enter Shapeless

The good news is, Shapeless has some tools that can help improve this - the bad news is, there isn't really any documentation on some of the features (beyond reading the source code and unit tests) and some of it just doesn't seem to be mentioned anywhere at all! I had previously used a function that Shapeless provides called everywhere - even this function isn't really explicitly called out in the docs, but I stumbled upon it in an article about what was new in Shapeless 2.0 where it was used in an example piece of code without any mention or explanation - everywhere allows in place editing of tree like structures (or any structures really) and was based on the ideas laid out in the Scrap Your Boilerplate (SYB) paper that large parts of the Shapeless library was based on.

As well as everywhere Shapeless also provides a function called everything which is also from the SYB paper, and instead of editing, it lets you simply traverse, or visit generic data structures. It's pretty simple, conceptually, but finding any mention of it in docs or footnotes was hard (I found it reading the source code), so lets go through it.

everything takes three arguments:

The first one is a polymorphic function that we want to process every step of the data structure, combine is a polymorphic function to combine the results, and complex (the third argument above) is our input - in this case the root of our nested data model.

So lets start with our polymorphic function for validating every step (this will be every attribute on each class, including lists, maps and other classes that will then get traversed as well (you can find out more about polymorphic functions and how they are implemented with Shapeless here):

So what is happening here? And why do we have two polymorphic functions? Well lets start with our second one, validates, that is going to be handling the validation. Remember our lovely and simple type class we defined earlier? we are going to use it here, in this polymorphic function we simply define this implicit function that will match on any attribute it finds what has an implicit ValidatorTypeClass in scope, and run the validation (in our simple example, returning a boolean result for whether it passes or fails).

Now, there are also going to be other types in our structure that we essentially want to ignore - they might be simple attributes (Strings, etc) or they might Lists, that we want to continue to traverse, but as a type in itself, we can just pass over. For this we need a polymorphic function which is essentially a No-Op and returns true. As the cases in the polymorphic function are implicits, we need to have the default case in the parent class so it is resolved as a lower priority than our validating implicit.

So, everywhere is going to handle the generic traversal of our data structure, what ever that might look like, and this polymorphic function is going to return a boolean to indicate whether every element in the structure is ok - now as mentioned, we need to combine all these results from our structure.

To do that, we just define another polymorphic function with arity 2 to define how we handle - which in the case of booleans is really very simple:

This combinator will simply combine the booleans, and as soon as one element fails, the overall answer will be false.

And thats it! Shapeless' everywhere handles the boilerplate, and with the addition of those minimal polymorphic functions we don'd need to worry about traversing anything or pattern matching on parent types - so it ends up really quite nice. Nine extra lines of code, and our type class approach works after all!


Footnote 1: Further removing boilerplate

If you found yourself writing code like this a lot, you could further simplify it, by changing our implicit ValidatorTypeClass to a more broad VisitorTypeClass and provide a common set of combinators for the combine polymorphic function, and then all you would need to do each time is provide the specific type class implementation of VisitorTypeClass and it would just work as if by magic.

Footnote 2: A better validation

As mentioned, the validation example was purely illustrative, as its a simple domain to understand, and there are other better ways to perform simple validation (at time of construction, other libraries etc), but if we were to have this perform validation, rather than return booleans, we could look to use something like Validated from Cats - this would allow us to accumulate meaningful failures throughout the traversal. This is really simple to drop in, and all we would need to do is implement the combine polymorphic function for ValidatedNel class:

Thankfully, Cats ValidatedNel is a Semigroup implementation, so it already provides the combine method itself, so all we need to do is call that! (Note: you will need to provide a Semigroup implementation for whatever right hand side you choose to use for Validated, but thats trivial for most types)



Kubernetes & Prometheus: Getting started

I have recently started working on a migration process to move our company deployments over to Kubernetes (from Fleet, if you were interested, which was at the time of dpeloyment a pretty cutting edge technology but it is pretty low level, and have to provide stuff like load balancing and DNS yourself).

A colleague of mine had already done the hard work in actually spinning up a Kubernetes cluster on AWS (using EKS) and generally most of the boiler plate around service deployment, so having had a general intro and deploying my first service (single microservice deployed as a Kubernetes “service” running inside a “pod”), which mostly just involved copy and pasting from my colleagues examples, my next goal was to deploy our monitoring setup. We currently use Prometheus & Graphana, and those still seem to be best in class monitoring systems, especially with Kubernetes.

The setup process is actually pretty simple to get up and running (at least if you are using Helm) but it did catch me out a couple times, so here are some notes.

Pre-requsites:
  1. A cluster running Kubernetes (as mentioned, we are using an AWS cluster on EKS)
  2. Kubectl & Helm running locally and connecting correctly to your kubernetes cluster (kubectl --version should display the client and server version details ok)

Let’s get started by getting our cluster ready to use Helm (Helm is a Kubernetes package manager that can be used to install pre-packaged "charts"), to do this we need to install the server side element of Helm, called Tiller, onto our Kubernetes cluster:

$ kubectl apply -f  tiller-role-binding.yml
helm init --service-account tiller


The above does three things:
  1. It creates a Service Account in your cluster called “tiller” in the standard kubernetes namespace “kube-system” - this will be used as the account for all the tiller operations
  2. Next we apply the role binding for the cluster - here we define a new ClisterRole for the new Service Account
  3. Finally we initialise Helm/Tiller, referencing our new Service Account. This step effectively installs Tiller on our kubernetes cluster

Straight forward enough. For Prometheus we will be using a Helm packaged Prometheus Operator. A Kubernetes Operator is an approach that allows packaging of an application that can be deployed on Kubernetes and can also be managed by the Kubernetes API - you can read more about operators here, and there are lots of operators already created for a range of applications.

As I found myself repeatedly updating the config for the install, I preferred to use the Helm “upgrade” method rather than install (upgrade works even if it has never been installed):

helm upgrade -f prometheus-config.yml \
      prometheus-operator stable/prometheus-operator \
      --namespace monitoring --install


The above command upgrades/installs the stable/prometheus-operator package (provided by CoreOS) into the “monitoring” namespace and names the install release as “prometheus-operator”.

At the start, the config was simply:


This config (which could have been passed as CLI arguments using “--set”, but was moved to a dedicated file simply because later on we will add a bunch more config) addresses two challenges we faced:

Adding Namespaces:

We use a dedicated namespace for our actual application (in the case above “dev-apps”), and additionally we wanted to monitor our applications themselves as well as the core kubernetes health, so we had to add that namespace there so Prometheus could monitor that as well.

Monitoring the right port:

The next one was more of a head-scratcher, and used up a lot more time figuring it out. With the stable/prometheus-operator Helm install, we noticed that on the targets page of Prometheus that the monitoring/prometheus-operator-kubelet/0 and monitoring/prometheus-operator-kubelet/1 were both showing as 0/3 up.

Our targets were showing 
monitoring/prometheus-operator-kubelet/0 (0/3 up)
monitoring/prometheus-operator-kubelet/1 (0/3 up)



They all looked correct, much like the other targets that were reported as being up, and were hitting the endpoints http://127.0.0.1/10255/metrics and /metrics/cadvisor but were all showing errors registered as "Connect: connection refused"

Initial googling revealed it was a fairly common symptom, however, rather misleadingly, all the issues documented were around particular flags that needed to be set and issues with auth (the errors listed were 401/403 rather than our error “connect: connection refused”) - and is also covered in the Prometheus-Operator troubleshooting section.

After much digging, what actually seemed to have caught us out was some conflicting default behaviour.

The Promehteus Operator defines three different ports to monitor on:


And of the three ports defined by the source Prometheus Operator code, the Helm chart is currently set to default to “http-metrics”, e.g. to use port 10255


However, more recently that read-only port, 10255, has been disabled and is no longer open to monitor against. This meant we had conflicting default behaviour across the software - so we had to explicitly override the default behaviour on prometheus operator by setting kubelet.servicemonitor.https flag to true. 


As you can see in the above defaulting, it switches between http-metrics port and https-metrics port based on the servicemonitor.https flag. Explicitly including that in our config overrode the default value and it switched to monitor on 10250 and all was fine then.


I expect the default behaviour will be changed soon, so this gotcha will hopefully be short lived, but in case it helps anyone else I will leave it here.

Next up I will attempt to explain some of the magic behind the Prometheus configuration and how it can be setup to easily monitor all (or any) of your Kubernetes services.


Machine Learning with AWS & Scala

Recently, in an attempt to starting learning React, I started building an akka-http backend API as a starting point. I quickly got distracted building the backend and ended up integrating with both the Twitter streaming API and AWS' Comprehend sentiment analysis API - which is what this post will be about.

Similar to an old idea, where I built an app consuming tweets about the 2015 Rugby world cup, this time my app was consuming tweets about the FIFA world cup in Russia - splitting tweets by country and recording sentiment for each one (and so a rolling average sentiment for each team).


Overview

The premise was simple:

  1. Connect to the Twitter streaming API (aka the firehose) filtering on world cup related key words
  2. Pass the body of the tweet to AWS Comprehend to get the sentiment score
  3. Update the in memory store of stats (count and average sentiment) for each country

In terms of technology used:
  1. Scala & Akka-Http
  2. Twitter4s Scala client
  3. AWS Java SDK

As always, all the code is on Github - to run it locally, you will need a Twitter dev API key (add an application.conf as per the readme on the Twitter4s github) and you will also need an AWS key/secret - the code will look for credentials stored locally but you can also just set them in environment variables before starting. The free tier supports up to 50,000 Comprehend API requests in the first 12 months - and as you can imagine, plugging this directly into twitter can result in lots of calls, so make sure you restrict it (or at least monitor it) before you leave it running!


Consuming Tweets

Consuming tweets is really simple with the Twitter4s client - we just define a partial function that will handle the incoming tweet. 

The other functions about parsing countries/teams are excluded for brevity - and you can see its quite simple - each inbound tweet we make a call to the Sentiment Service (we will look at that later) then pass it with the additional data to our update service that will then store it in memory. You will also see it is ridiculously easy to start the Twitter streaming client filtering by key words.


Detecting Sentiment

Because I wanted to be able to stub out the sentiment analysis without being tied to AWS, you will notice I am using the self-type annotation on my twitter class above, which requires a SentimentModule to be passed in at construction - I am using a simple cake pattern to manage all my dependencies here. In the Github repo, there is also a Dummy implementation, that will just pick a random number for the score, so you can still see the rest of the API working - but the interesting part is the AWS integration:
Once again, the SDK makes the integration really painless - in my code I am simplifying the actual results a lot to a much cruder Positive/Neutral/Negative rating (plus a numeric score -100..100).

The AWSCredentials class is the bit that is going to look in the normal places for an AWS key.


Storing and updating our stats

So now we have our inbound tweets and a way to asses their sentiment score - I then setup a very simple akka actor to manage the state and just stored the API data in memory (if you restart the app, the store gets reset and the API stops serving data).

Again, very simple out of the box stuff for akka, but it allows easy and thread safe management of the in-memory data store. I also track a rolling list of the last twenty tweets processed, which is managed by a second, almost identical, actor.


The results

I ran the app during several games, below are some sample outputs from the API. The response from the stats API is fairly boring reading (just numbers) but the example tweets show two examples of a positive and neutral tweet correctly identified (apologies for the expletives in the tweet about Poland - I guess that fan wasn't too happy about being beaten by the Senegalese!) - you will also notice, the app captures the countries being mentioned, which exposes one flaw of the design: in the negative tweet from the Polish fan loosing two goals to Senegal, it correctly identifies the sentiment as negative, but we have no way to determine the subject - as both teams are mentioned, the app naively assigns it as a negative tweet to both of the teams, where as on reading, it is clearly negative with regards to Poland (I wasn't too concerned for my experiment, of course, just an observation worth noting).

Sample tweet from the latest API:

Sample response from the stats API:

When I finally did get around to starting to learn React, I just plugged in the APIs and paid no attention to styling, which is a round about way of apologising for the horrible appearence of the screenshot below (I'm really sorry about the css gradient)!





Serverless with AWS Lambda & Scala

About a year ago, I started looking at AWS's serverless offering, AWS Lambda. The premise is relatively simple, rather than a full server running that you manage and deploy your docker/web servers to, you just define a single function endpoint and map that to the API gateway and you have an infinitely* scale-able endpoint.

The appeal is fairly obvious - no maintenance or upgrading servers, fully scalable and pay per second of usage (so no cost for AWS Lambda functions that you define whilst not being called). I haven't looked into the performance of using the JVM based Lambda functions, but my assumption is that there will be potential performance costs if your function isn't frequently used, as AWS will have to start up the function, load its dependencies etc, so depending on your use case, it would be advisable to look at performance bench marking before putting into production use.

When I first looked into AWS Lambda a year ago, it was less mature than it is today, and dealing with input/output JSON objects required annotating POJOs, so I decided to start putting together a small library to make it easier to work with AWS Lambda in a more idiomatic Scala way - using Circe and it's automatic encoder/decoder generation with Shapeless. The code is all available on github.

Getting Started

To deploy on AWS I used a framework called Serverless - this is a really easy framework to setup serverless functions on a range of cloud providers. Once you have followed the pre-requisite install steps, you can simply run:

serverless create --template aws-java-gradle 

This will generate you a Java (JVM) based gradle project template, with a YML configuration file in the root that defines your endpoints and function call. If you look in the src folder as well, you will also see the classes for a very simple function that you can deploy and check your Lambda works as expected (you should also take the time at this point to login to your AWS console and have a look at what has been created in the Lambda and API Gateway sections. You should now be able to curl your API endpoint (or use the serverless cli with a command like: serverless invoke -f YOUR_FUNCTION__NAME -l).

ScaLambda - AWS Lambda with idiomatic Scala

Ok, so we have a nice simple Java based AWS Lambda function deployed and working, let's looking at moving it to Scala. As you try to build an API in this way you will need to be able to define endpoints that can receive inbound JSON being posted as well as return fixed JSON structures - AWS provides its inbuilt de/serialisation support, but inevitably you will have a type that might need further customisation of how it is de/serialized (UUIDs maybe, custom date formats etc) and there are a few nice libraries that can handle this stuff and Scala has some nice ways that can simplify this.

We can simply upgrade our new Java project to a Scala one (either convert the build.gradle to an sbt file, or just add Scala dependency/plugins to the build file as is) and then add the dependency:



We can now update the input/output classes so they are just normal Scala case classes:


Not a huge change from the POJOs we had, but is both more idiomatic and also means you can use case classes that you have in other existing Scala projects/libraries elsewhere in your tech stack.

Next we can update the request handler - this will also result in quite similar looking code to the original generated Java code, but will be in Scala and will be backed by Circe and it's automatic JSON encoder/decoder derivation.



You will see that similar to the AWS Java class we define generic parameter types for the class that represents the input case class and the output case class and then you simply implement the handleRequest method which expects the input class and returns the output response.

You might notice the return type is wrapped in the ApiResponse class - this is simply an alias for a Scala Either[Exception, T] - which means if you need to respond with an error from your function you can just return an exception rather than the TestOutput. To simplify this, there is an ApiResponse companion object that provides a success and failure method:

All the JSON serialisation/de-serialisation will use Circe's auto derived code which relies on Shapeless - if you use custom types that cannot be automatically derived, then you can just define implicit encoder/decoders for your type and they will be used.

Error handling

The library also has support for error handling - as the ApiResponse class supports returning exceptions, we need to map those exceptions back to something that can be returned by our API. To support this, the Controller class that we have implemented for our Lambda function expects (via self type annotations) to be provided an implementation of the ExceptionHandlerComponent trait and of the ResponseSerializerComponent trait.

Out of the box, the library provides a default implementation of each of these that can be used, but they can easily be replaced with custom implementations to handle any custom exception handling required:

Custom response envelopes

We mentioned above that the we also need to provide an implementation of the ResponseSerializerComponent trait. A common pattern in building APIs is the need to wrap all response messages in a custom envelope or response wrapper - we might want to include status codes or additional metadata (paging, rate limiting etc) - this is the job of the ResponseSerializerComponent. The default implementation simply wraps the response inside a basic response message with a status code included, but this could easily be extended/changed as needed.

Conclusion

The project is still in early stages of exploring the AWS Lambda stuff, but hopefully is starting to provide a useful approach to idiomatic Scala with AWS Lambda functions, allowing re-use of error handling and serialisation so you can just focus on the business logic required for the function.


An opinionated guide to building APIs with Akka-Http

Akka-Http is my preferred framework for building APIs, but there are some things I have picked up along the way. For one thing, Akka-Http is very un-opinionated in its approach, there are often lots of ways to do the same thing, and there isn't a lot of opinionated guidance about how to do things.

I have been writing Akka-Http APIs for I guess about 18 months now (not long, I know), having previously worked predominantly with libraries like Spring, and I have seen some pretty nasty code resulting from the this (by this I mean, I have written nasty code - not intentionally, of course, but from good intentions starting off trying to write, clean, idiomatic Akka-Http code, and ending up in huge sprawling routing classes which are un-readable and generally not very nice).

The routing DSL is Akka-Http is pretty nice, but can quickly become unwieldy. For example, let's imagine you start off with something like this:


This looks nice right? A simple nested approach to the routing structure that reflects the URL hierarchy and the HTTP method etc. However, as you can probably imagine, try and scale this up to a full application it can very easily become fairly messy. The nested directives make it nice to group routes under similar root URLs but as you do that you end up with very long, arrow-shaped code that actually isn’t that easy to follow - if you have several endpoints nested within the structure it actually becomes quite hard to work out what endpoints there are and what is handling what.

Another problem that needs to be managed is that with the first one or two endpoints you might put the handling code directly in the routing structure, which is ok for very small numbers, but it needs to be managed sensibly as the endpoints grow and your routing structure starts to look more and more sprawling.

It is of course personal preference, but even with the simple example above, I don’t like the level of nesting that already exists there to simply define the mapping of the GET HTTP method and a given URL - and if you add more endpoints and start to break down the URL with additional directives per URL section then the nesting increases.


To simplify the code, and keep it clean from the start I go for the following approach:

  1. Make sure your Routing classes are sensibly separated - probably by the URL root (e.g. have a single UserRoutes class that handles all URLs under /users) to avoid them growing too much
  2. Hand off all business logic (well, within reason) to a service class - I use Scala’s Self-Type notation to handle this and keep it nicely de-coupled
  3. Use custom directives & non-nested routings to make the DSL more concise

Most of these steps are simple and self explanatory, so its probably just step 3 that needs some more explanation. To start with, here is a simple example:

You can see points 1 and 2 simply enough, but you will also notice that my endpoints are simple functions, without multiple levels of nesting (we may need some additional nesting at some point, as some endpoints will likely need other akka-http directives, but we can strive to keep it minimal). 

You might notice I have duplicated the URL section “users” rather than nesting it - some people might not like this duplication (and I guess risk of error/divergence of URLs - but that can be mitigated with having predefined constants instead of explicit strings), but I prefer the readability and simplicity of this over extensive nesting.


Custom Directives

First off, I have simply combined a couple of existing directives to make it more concise. Normally, you might have several levels of nested directives such as one or more pathPrefix(“path”) sections, the HTTP Method such as get{} another one to match pathEndOrSingleslash{} - To avoid this I have concatenated some of these to convenient single points.


getPath, postPath, putPath, etc simply combine the HTTP method with the URL path-matcher, and also includes the existing Akka-Http directive “redirectToTrailingSlashIfMissing” which avoids having to specify matching on either a slash or path end, and instead allows you to always match exact paths - It basically squashes the three directives in the original HelloWorld example above down to one simple, readable directive.


Custom Serialisation

You may also notice, I have implemented a custom method called “respond” - I use this to handle the serialisation of the response to a common JSON shape and to handle errors. Using this approach, I define a custom Response wrapper type that is essentially an Either of our internal custom error type and a valid response type T (implementation details below) - this means in all our code we have a consistent type that can be used to handle errors and ensure consistent responses.

This respond method simply expects a Response type to be passed to it (along with an optional success status code - defaulting to 200 OK, but can be provided to support alternative success codes). The method then uses Circe and Shapeless to convert the Response to a common JSON object. 

Let’s have a look at some of the details, first the custom types I have defined for errors and custom Response type:


Simple, now let’s take a look at the implementation of the respond method:

It might look daunting (or not, depending on your familiarity with Scala and Shapeless), but its relatively simple. The two implicit Encoder arguments that are included on the method signature simply ensure that whatever type A is in the provided Response[A], Circe & shapeless are able to serialise it. If you try to pass some response to this method that can’t be serialised you get a compile error. After that, all it does is wraps the response A in a common message and returns that along with an appropriate (or provided) HTTP status code.

You might also notice the final result is built using the wrap method in the ResponseWrapperEncoder trait - this allows easy extension/overriding of what the common response message looks like.


Conclusion

All of this machinery is of course abstracted away to a common library that can be used across different projects, and so, in reality it means we have a consistent, clean API with simple routing classes as simple and neat as below, whilst also handing off our business logic to neater, testable services.

All the code for my opinionated library and an example API is all on GitHub, and it is currently in progress with more ideas underway!

Generic programming in Scala with Shapeless

In my last post about evolutionary computing, I mentioned I started building the project partly just so I could have a play with Shapeless. Shapeless is a generic programming library for Scala, which is growing in popularity - but can be fairly complicated to get started with.

I had used Shapeless from a distance up until this point - using libraries like Circe or Spray-Json-Shapeless that use Shapeless under the hood to do stuff like JSON de/serialisation without boilerplate overhead of having to define all the different case classes/sealed traits that we want to serialise.

You can read the previous post for more details (and the code) if you want to understand exactly what the goal was, but to simplify, the first thing I needed to achieve was for a given case class, generate a brand new instance.

Now I wanted the client of the library to be able to pass in any case class they so wished, so I needed to be able to generically inspect all attributes of any given case class* and then decide how to generate a new instance for it.

Thinking about this problem in traditional JVM terms, you may be thinking you could use reflection - which is certainly one option, although some people still have a strong dislike for reflection in the JVM, and also has the downside that its not type safe - that is, if someone passes in some attribute that you don't want to support (a DateTime, UUID, etc) then it would have to be handled at runtime (which also means more exception handling code, as well as loosing the compile time safety). Likewise, you could just ask clients of the code to pass in a Map of the attributes, which dodges the need for reflection but still has no type safety either.



Enter Shapeless

Shapeless allows us to represent case classes and sealed traits in a generic structure, that can be easily inspected, transformed and put into new classes.

A case class, product and HList walk into a bar

A case class can be thought of as a product (in the functional programming, algebraic data type sense), that is, case class Person(name: String, age: Int, living: Boolean) is the product of name AND age AND living - that is, every time you have an instance of that case class you will always have an instance of name AND age AND living. Great. So, as well as being a product, the case class in this example also carries semantic meaning in the type itself - that is, for this instance as well as having these three attributes we also know an additional piece of information that this is a Person. This is of course super helpful, and kinda central to the idea of a type system! But maybe sometimes, we want to be able to just generically (but still type safe!) say I have some code that doesn't care about the semantics of what something is, but if it has three attributes of String, Int, Boolean, then I can handle it - and that's what Shapeless provides - It provides a structure called a HList (a Heterogeneous List) that allows you to define a type as a list of different types, but in a compile time checked way.

Rather that a standard List in Scala (or Java) where by we define the type that the list will contain, with a HList, we can define the list as specific types, for example:
The above example allows us to define out HList with specific types, which provides our compile time safety - the :: notation is some syntactical sugar provided by Shapeless to mirror normal Scala list behaviour, and HNil is the type for an empty HList.

Case class to HList and back again

Ok, cool - we have a way to explicitly define very specific heterogeneous lists of types - how can this help us? We might not want to be explicitly defining these lists everywhere. Once again Shapeless steps in and provides a type class called Generic.

The Generic type class allows conversion of a case class to a HList, and back again, for example:
Through compile time macros, Shapeless provides us these Generic type classes for any case class. We can then use these Generic classes to convert to and from case classes and HLists:



Back to the problem at hand

So, we now have the means to convert any given case class to a HList (that can be inspected, traversed etc) - which is great, as this means we can allow our client to pass in any case class and we just need to be able to handle the generation of/from a HList.

To start with, we will define our type class for generation:

Now, in the companion object we will define a helper method, that can be called simply without needing anything passed into scope, and the necessary Generator instance will be provided implicitly (and if there aren't any appropriate implicits, then we get our compile time error, which is our desired behaviour)

As you can see, we can just call the helper method with an appropriate type argument and as long as there is an implicit Generator in scope for that type, then it will invoke that generate method and *hopefully* generate a new instance of that type.

Next, we need to define some specific generators, for now, I will just add the implicits to support some basic types:
All simple so far - if we call Generator.generate[Int] then it gets the implicit Generator[Int] and generates a random int (not that useful, perhaps, but it works!)

Now comes the part where our generate method can handle a case class being passed in, and subsequently the Shapeless HList representations of a case class (we will recursively traverse a HList structure using appropriate implict generators). First up, lets look at how we can implicitly handle a case class being passed in - at first this may start to look daunting, but its really not that bad.. honestly:
So, what is going on there? The implicit is bound by two types: [T, L <: HList] - which is then used in the implicit argument passed into the method implicit generic: Generic.Aux[T, L] and argument lGen: Generator[L] - this matches for all types T that there exists in scope a Shapeless Generic instance that can convert it into some HList L, and for which L an implicit Generator exists.  Now we know from our earlier look at Shapeless that it will provide a Generic instance for any case class (and case classes will be converted into a HList), so this implicit will be used in the scenario where by we pass in a case class instance, as long as we have an implicit Generator in scope that can handle a HList: So lets add that next!

Let's start with the base-case, as we saw earlier, HNil is the type for an empty HList so lets start with that one:
Remember, the goal is from a given case class (and therefore a given HList) is to generate a new one, so for an empty HList, we just want to return the same.

Next we need to handle a generator for a non-empty HList:
So this case is also relatively simple, because our HList is essentially like a linked list, we will handle it recursively - and the type parameters [H, T <: HList] represent the type of the head of the HList (probably a simple type -  in which case it will try to resolve the implicit using our original Int/Boolean/Double generators) and the tail of the HList which will be another HList (possibly a HNil if we are at the end of the list, otherwise this will resolve recursively to this generator again). We grab the implicit Generator for the head and tail then simply call generate on both and that's really all there is to it! We have defined the implicit for the top level case class (which converts it to the generic HList and then calls generate on that), we have the implicit to recursively traverse the HList structure and we have finally the implicits to handle the simple types that might be in our case class.

Now we can simply define a case class, pass it to our helper method and it will generate a brand new instance for us:

I was using this approach to generate candidates for my evolutionary algorithm, however the exact same approach could be used easily to generate test data (or much more). I used the same approach documented here to later transform instances of case classes as well.


Here is the completed code put together:


Footnote

I have mentioned so far that Shapeless provides the means to transform case classes and sealed traits to generic structures, but have only really talked about case classes. The sealed trait is a slightly different shape to a case class (it is analogous to the algebraic data type co-product) and accordingly, Shapeless has a different structure to handle it, rather than a HList it has a type called Coproduct. The above code can be extended to also handle sealed traits, to do so you just need to add the implicit to handle the empty coproduct (CNil) and recursively handle the Coproduct.

Managing your personal finance (a Java webapp)

"Cash Rules Everything Around Me" - Method Man

A few years ago I was fed up of how useless all the banks were at helping me understand my transactions/spending patterns with their online banking. Basically, they all just seemed to let you log on, and then see a list of transactions (often just referencing codes) of payments in and out by date. Nothing else. It seemed like there were lots of simple things missing that could make it a lot more useful, to name a few:


  • Smarter tagging/categorisation - where some direct debit was just listed by the code of the recipient rather than meaningful info, or being able to group all the spending in coffee shops
  • Alerting to changes/unexpected behaviour - for example when a fixed term-price contract comes to an end and the price changes - detecting a change in a regular fixed aount/recipient should be easy

So this is what I started building - a lot of the functionality has since been built into banks such as Monzo, the aim of this webapp was simply to allow you to import transaction histories (most online banks I use provide the ability to export transactions in csv) so you could view/filter/tag any existing bank data. It only supports Santander at the moment, and the alerting stuff never actually got built in, but I thought I would chuck it on Github rather than have it just sit around gathering dust in a private repo.

I have written about the project a while ago, when it was first underway, but as it has been not worked on for a while, I thought I would move it to GitHub - hence the post here today. The app and all the code can be downloaded here: https://github.com/robhinds/cream


Building & Running

Its a simple java, maven app (it was several years ago, so not migrated to Spring Boot, Gradle, Groovy etc) - and it builds a WAR file. It also uses LESS for style stuff, which is also built by maven (although watch out for silent failures if you make changes!). If you are using an eclipse based IDE you can follow the guide here: to get incremental build support for LESS building (e.g. just change the LESS source and go refresh in the browser and it will reload).

You can run the WAR file in any tomcat/IDE server, and the backend is currently expecting a MySql DB (but can easily be switched for a differed DB driver if required)


Roadmap


  • Migrate to Spring Boot, Groovy, Gradle
  • Add further bank transaction importers
  • Add alerting framework



Screenshots

(I loaded in a generated transaction history - which is why some of the charts look a little weird)




Using OpenNLP for Named-Entity-Recognition in Scala

A common challenge in Natural Language Processing (NLP) is Named Entity Recognition (NER) - this is the process of extracting specific pieces of data from a body of text, commonly people, places and organisations (for example trying to extract the name of all people mentioned in a wikipedia article). NER is a problem that has been tackled many times over the evolution of NLP, from dictionary based, to rule based, to statistical models and more recently using Neural Nets to solve the problem.

Whilst there have been recent attempts to crack the problem without it, the crux of the issue is really that for approach to learn it needs a large corpus of marked up training data (there are some marked up corpora available, but the problem is still quite domain specific, so training on the WSJ data might not perform particularly well against your domain specific data) and finding a set of 100,000 marked up sentences is no easy feat.  There are some approaches that can be used to tackle this by generating training data - but it can be hard to generate truly representative data and so this approach always risks over-fitting to the generated data.

Having previously looked at Stanford's NLP library for some sentiment analysis, this time I am looking at using the OpenNLP library. Stanford's library is often referred to as the benchmark for several NLP problems, however, these benchmarks are always against the data it is trained for - so out of the box, we likely won't get amazing results against a custom dataset. Further to this, the Stanford library is licensed under GPL which makes it harder to use in any kind of commercial/startup setting. The OpenNLP library has been around for several years, but one of its strengths is its API - its pretty well documented to get up and running, and is all very extendable.


Training a custom NER

Once again, for this exercise we are going back to the BBC recipe archive for the source data - we are going to try and train an OpenNLP model that can identify ingredients.

To train the model we need some example sentences - they recommend at least 15,000 marked up sentences to train a model - so for this I annotated a bunch of the recipe steps and ended up with somewhere in the region of about 45,000 sentences.

As you can see in the above example, the marked up sentences are quite straight forward. We just wrap the ingredient in the tags as above (although note that if the word itself isn't padded by a space either side inside the tags, it will fail!).

Once we have our training data, we can just easily setup some code to feed it in and train our model:

This is a very simple example of how you can do it, and not always paying attention to engineering best practices, but you get the idea for whats going on. We are getting an input stream of our training data set, then we instantiate the Maximum Entropy name finder class and ask it to train a model, which we can then write to disk for future use.

When we want to use the model, we can simply load it back into the OpenNLP Name Finder class and use that to parse the input text we want to check:

So, once I had created some training data in the required format, and trained a model I wanted to see how well it had actually worked - obviously, I don't want to run it against one of the original recipes as they were used to train the model, so I selected this recipe for rosemary-caramel millionaire shortbread, to see how it performed, here are the ingredients it found:

  • butter
  • sugar
  • rosemary
  • caramel
  • shortbread


All in all, pretty good - it missed some ingredients, but given the training data was created in about 20 minutes just manipulating the original recipe set with some groovy, that's to be expected really, but it did well in not returning false positives.


In conclusion, if you have a decent training set, or have the means to generate some data with a decent range, you can get some pretty good results using the library. As usual, the code for the project is on GitHub (although it is little more than the code shown in this post).

Unsupervised Learning in Scala using Word2Vec

A pretty cool thing that has come out of recent Machine Learning advancements is the idea of "Word Embedding", specifically the advancements in the field made by Tomas Mikolov and his team at Google with the Word2Vec approach. Word Embedding is a language modelling approach that involves mapping words to vectors of numbers - If you imagine we are modelling every word in a given body of text to an N-dimension vector (it might be easier to visualise this as 2-dimensions - so each word is a pair of co-ordinates that can be plot on a graph), then that could be useful in plotting words and starting to understand relationships between words given their proximity. What's more, if we could map words to sets of numbers, then we could start thinking about interesting arithmetic that we could perform on the words.

Sounds cool, right? Now of course, the tricky bit is how can you convert a word to a vector of numbers in such a way that it encapsulates the details behind this relationship? And how can we do it without painstaking manual work and trying to somehow indicate semantic relationships and meaning in the words?


Unsupervised Learning

Word2Vec relies on neural networks and trains on a large, un-labelled piece of text in a technique known as "unsupervised" learning.

Contrary to the last neural network I discussed which was a "supervised" exercise (e.g. for every input record we had the expected output/answer), Word2Vec uses a completely "unsupervised" approach - in other words, the neural network simply takes a massive block of text with no markup or labels (broken into sentences or lines usually) and then uses that to train itself.

This kind of unsupervised learning can seem a little unbelievable at first, getting your head around the idea that a network could train itself without even knowing the "answers" seemed a little strange to me first time I heard the concept, especially as a fundamental requirement for a NN to converge on optimum solution requires a "cost-function" (e.g. some thing we can use after each feed-forward step to tell us how right we are, and if our NN is heading in the right direction).

But really, if we think back to the literal biological comparison with the brain, as people we learn through this unsupervised approach all the time - its basically trial-and-error.


It's child's play

Imagine a toddler attempting to learn to use a smart phone or tablet: they likely don't get shown explicitly to press an icon, or to swipe to unlock, but they might try combinations of power buttons, volume controls and swiping and seeing what happens (and if it does what they are ultimately trying to do), and they get feedback from the device - not direct feedback about what the correct gesture is, or how wrong they were, just the feedback that it doesn't do what they want - and if you have ever lived with a toddler who has got to grips with touchscreens, you may have noticed that when they then experience a TV or laptop, they instinctively attempt to touch or swipe the things on the screen that they want (in NN terms this would be known as "over fitting" - they have trained on too specific a set of data, so are poor at generalising - luckily, the introduction of a non-touch screen such as a TV expands their training set and they continue to improve their NN, getting better at generalising!)

So, this is basically how Word2Vec works. Which is pretty amazing if you think about it (well, I think its neat).


Word2Vec approaches

So how does this apply to Word2Vec? Well just like a smartphone gives implicit, in-direct feedback to a toddler, so the input data can provide feedback to itself. There are broadly two techniques when training the network:

Continuous Bag of Words (CBOW)

So, our NN has a large body of text broken up into sentences/lines - and just like in our last NN example, we take the first row from the training set, but we don't just take the whole sentence to push into the NN (after all, the sentence will be variable length, which would confuse our input neurons), instead we take a set number of words - referred to as the "window size", let's say 5, and feed those into the network. In this approach, the goal is for the NN to try and correctly guess the middle word in that window - that is, given a phrase of 5 words, the NN attempts to guess the word at position 3.

[It was ___ of those] days, not much to do

So its unsupervised learning, as we haven't had to go through any data and label things, or do any additional pre-processing - we can simply feed in any large body of text and it can just try to guess the words given their context.

Skip-gram

The Skip-gram approach is similar, but the inverse - that is, given the word at position n, it attempts to guess the words at position n-2, n-1, n+1, n+2.

[__ ___ one __ _____] days, not much to do

The network is trying to work out which word(s) are missing, and just looks to the data itself to see if it can guess it correctly.


Word2Vec with DeepLearning4J

So one popular deep-learning & word2vec implementation on the JVM is DeepLearning4J. It is pretty simple to use to get used to what is going on, and is pretty well documented (along with some good high-level overviews of some core topics). You can get up and running playing with the library and some example datasets pretty quickly following their guide. Their NN setup is also equally simple and worth playing with, their MNIST hello-world tutorial lets you get up and running with that dataset pretty quickly.

Food2Vec

A little while ago, I wrote a web crawler for the BBC food recipe archive, so I happened to have several thousand recipes sitting around and thought it might be fun to feed those recipes into Word2Vec to see if it could give any interesting results or if it was any good at recommending food pairings based on the semantic features the Word2Vec NN extracts from the data.

The first thing I tried was just using the ingredient list as a sentence - hoping that it would be better for extracting the relationship between ingredients, with each complete list of ingredients being input as a sentence.  My hope was that if I queried the trained model for X is to Beef, as Rosemary is to Lamb, I would start to get some interesting results - or at least be able to enter an ingredient and get similar ingredients to help identify possible substitutions.

As you can see, it has managed to extract some meaning from the data - for both pork and lamb, the nearest words do seem to be related to the target word, but not so much that could really be useful. Although this in itself is pretty exciting - it has taken an un-labelled body of text and has been able to learn some pretty accurate relationships between words.

Actually, on reflection, a list of ingredients isn't actually that great an input, as it isn't a natural structure and there is no natural ordering of the words - a lot of meaning is captured in the phrases rather than just lists of words.

So next up, I used the instructions for the recipes - each step in the recipe became a sentence for input, and minimal cleanup was needed, however, with some basic tweaking (it's fairly possible that if I played more with the Word2Vec configuration I could have got some improved results) the results weren't really that much better, and for the same lamb & pork search this was the output:

Again, its still impressive to see that some meaning has been found from these words, is it better than raw ingredient list? I think not - the pork one seems wrong, as it seems to have very much aligned pork as a poultry (although maybe that is some meaningful insight that conventional wisdom just hasn't taught us yet!?)

Arithmetic

Whilst this is pretty cool, there is further fun that can be had - in the form of simple arithmetic. A simple, often quoted example, is the case of countries and their capital cities - well trained Word2Vec models have countries and their capital cities equal distances apart:

(graph taken from DeepLearning4J Word2Vec intro)

So could we extract similar relationships between food stuffs?  The short answer, with the models trained so far, was kind of..

Word2Vec supports the idea of positive and negative matches when looking for nearest words - that allows you to find these kind of relationships. So what we are looking for is something like "X is to Lamb, as thigh is to chicken" (e.g. hopefully this should find a part of the lamb), and hopefully use this to extract further information about ingredient relationships that could be useful in thinking about food.

So, I ran that arithmetic against my two models.
The instructions based model returned the following output:

Which is a pretty good effort - I think if I had to name a lamb equivalent of chicken thigh, a lamb shank is probably what I would have gone for (top of the leg, both pieces of slow twitch muscle and both the more game-y, flavourful pieces of the animal - I will stop as we are getting into food-nerd territory).

I also ran the same query on the ingredients based set (which remember, ran better on the basic nearest words test):

Which interestingly, doesn't seem as good. It has the shin, which isn't too bad in as far as its the leg of the animals, but not quite as good a match as the previous.


Let us play

Once you have the input data, Word2Vec is super easy to get up and running. As always, the code is on GitHub if you want to see the build stuff (I did have to fudge some dependencies and exclude some stuff to get it running on Ubuntu - you may get errors relating to javacpp or jnind4j not available - but the build file has the required work arounds in to get that running), but the interesting bit is as follows:
If we run through what we are setting up here:

  1. Stop words - these are words we know we want to ignore -  I originally ruled these out as I didn't want measurements of ingredients to take too much meaning. 
  2. Line iterator and tokenizer - these are just core DL4J classes that will take care of processing the text line by line, word by word. This makes things much easier for us, so we don't have to worry about that stuff
  3. Min word frequency - this is the threshold for words to be interesting to us - if a word appears less than this number of times in the text then we don't include the mapping (as we aren't confident we have a strong enough signal for it)
  4. Iterations - how many training cycles are we going to loop for
  5. Layer size - this is the size of the vector that we will produce for each word - in this case we are saying we want to map each word to a 300 dimension vector, you can consider each vector a "feature" of the word that is being learnt, this is a part of the network that will really need to be tuned to each specific problem
  6. Seed - this is just used to "seed" the random numbers used in the network setup, setting this helps us get more repeatable results
  7. Window size - this is the number of words to use as input to our NN each time - relates to the CBOW/Skip-gram approaches described above.

And that's all you need to really get your first Word2Vec model up and running! So find some interesting data, load it in and start seeing what interesting stuff you can find.

So go have fun - try and find some interesting data sets of text stuff you can feed in and what you can work out about the relationships - and feel free to comment here with anything interesting you find.