Showing posts with label code. Show all posts

An overview: How to implement Google's Structured Data Schemas

As a result of working in web products (both personal and professional) I have become quite au fait with SEO from a technical point of view, and purely technical things you can do to put you in good standing - all of which really, are just web best practices to be a good web citizen - things that we should be doing to make the web better (proper use of HTTP headers, mobile friendly designs, good performance for people on slower connections etc). 

One thing that is a bit above and beyond that is the use of Google's Structured Data. I will talk about what it is and what it does below, but if you are dynamically loading webpages (e.g. your website isn't just HTML on a web server, but either served from a server-side application, or is an API driven JS application), then you are most likely well placed to easily and immediately start implementing it.


1. What is Structured Data?

Google has defined a set of schemas regarding Structured Data on websites. This is a schema that allows better definition of key data points from a given website. It's a sensible move by Google and is a natural progression for their search engine.

Think about it, there are millions of websites out there made up with sprawling HTML code and content - whilst HTML is a standard (more or less!) across the web, there are probably millions of different ways people use it. It would be nice if everyone used the H1 etc heading tags consistently, or if everyone used the <em> tag the same way (emphasis vs italics), but the reality is they don't - some sites will be using HTML as intended but many many more undoubtedly just rely on <span> or <div> tags combined with CSS classes to re-define every element they might need. 

This is all fine, Google is smart enough to pull out the content for indexing - yes, if you use span elements with custom styling for headings on your website rather than the H1+ tags then Google will penalize you, but it won't stop Google reading and indexing the site. Whats more, its getting smarter all the time -  I'd probably back Google to be able to pull out relevant clips or question/answers directly in a fairly reliable way. However, they are Google and much like the Microsoft/IE of the 90s, they have the dominant market share so they can define their own standards for the web if they want to. That's exactly what Structured Data is. 

It's Google saying: 

Hey, if you provide some data in your code that looks like this then we will read that, so we don't have to guess or work out stuff. Or we can just keep trying to work it out from your html content.. it's your call


As mentioned earlier, if you have control over the source code and the data on your website, then this is really powerful. You can define specific (keyword heavy) snippets of content and explicitly tell Google about them - what's more the Schema for Structured Data lets you define content such as FAQ or how-to directions - so you can go beyond keyword heavy snippets and actually create FAQ for questions that you have identified from Google search trends (or whatever SEO tools you might use).  

Hopefully you get the picture by now - this is a pretty powerful tool.


2. Schema tags: FAQ and HowTo

Two specific parts of the Structured Data Schema stood out to me as quite universally useful for websites, and also produce decent tangible results: FAQ schema and HowTo schema.

  • FAQ schema allows high level Q&A points to be provided in the metadata for Google - generally useful as most sites will have some element of their pages that could be presented as FAQ

  • HowTo schema allows step-by-step how to guides - less widely applicable, but if you have any website that provides how-to guides or anything with instructions this is applicable.

What exactly to these tags do and why do we care? Well, as well as trying to win favour with the all-seeing google search bot, if it gets picked up it also means we get more search real estate and increased accessibility to our content which should also increase the chance of click through conversion.

If you have ever seen search results like this:


These points are being scraped from the site by Google from their schema tags - if you can automate the creation and inclusion of these for your site then you can be in a pretty good position to improve your SEO (still relatively low number of sites implementing these schemas).


3. Automating your schema tags

As I have mentioned a couple of times, and as you have hopefully realised - if you have a dynamic website, you are likely already taking structured data (from your database for example, so reliably structured!) and building HTML pages (either server-side, or data as JSON to a javascript app for page creation client side) - but either way, we are starting off with structured data, and the Google Structured Data wants.. you got it, Structured Data! So if you have the content, it really is a generic, simple transform between structured data formats.

Below is some example code - It is based on jekyll, as thats what my most recent personal project has been, but it's also pretty close to pseudocode, so hopefully you can easily translate it to whatever tech you use:

As you can see, its a simple JSON based data structure and you just fill in the relevant parts of the object with your data. 

You can see the full code in my Jekyll food website over on github - likewise, you can see the end result in action on the website (hosted by github pages, of course) too - the project was a food-science site - covering science and recipes, so a perfect match for FAQ (science pages) and HowTo (recipe pages) - for example if you go to a chilli recipe page which naturally has structured data for step-by-step instructions, and view the page source, you will see the JSON schema at the top of the page, using the HowTo schema elements laying out resources required and then the steps required. Likewise on the Science of humidity in cooking in the page source you will see the JSON schema with FAQ:









4. Conclusion

If you have control of the data being loaded on to the page (a custom site - something like WordPress or another off the shelf platform might make this harder, but at the same time there are undoubtedly lots of plugins that already handle this for you on your given platform which make it even easier!), and are even vaguely interested in increasing organic traffic and search ranking, then I'd recommend implementing this. It's very easy to add (it's also primarily behind the scenes tech - it doesn't impact the visuals or design of the pages its added to) and can only be beneficial.

As with anything trying to improve SEO, its always hard to measure and get concrete details on, but its relatively low cost to add, minimal ongoing maintenance so I think its worth a shot.


If you have had experiences with Google's Structured data, good or bad, then I'd love to hear about it!

Generic Programming with Scala & Shapeless part 2

Last year I spent some time playing with, and writing about, Scala & Shapeless - walking through the simple example of generating random test data for a case class.

Recently, I have played some more with Shapeless, this time with the goal of generating React (javascript) components for case classes. It was a very similar exercise, but this time I made use of the LabelledGeneric object, so I could access the field names - so I thought I'd re-visit here and talk a bit about some of the internals of what is going on.


Getting started

As before, I had to define implicits for the simple types I wanted to be able to handle, and the starting point is of course accepting a case class as input.

So there are a few interesting things going on here:

First of all, the method is parameterised with two types: caseClassToGenerator[A, Repr <: HList] A is simply going to be our case class type, and Repr is going to be a Shapeless HList.

Next up, we are expecting several implict method arguments (we will ignore the third implicit for now, that is an implicit that I am using purely for the react side of things - this can be skipped if the method handles everything itself):

implicit generic: LabelledGeneric.Aux[A, Repr], gen: ComponentGenerator[Repr],

Now, as this method's purpose is to handle the input of a case class, and as we are using shapeless, we want to make sure that from that starting input we can transform it into a HList so we can then start dealing with the fields one by one (in other words, this is the first step in converting a case class to a generic list that we can then handle element by element).  In this setting, the second implicit argument is asking the compiler to check that we have also defined an appropriate ComponentGenerator (this is my custom type for generating React components) that can handle the generic HList representation (its no good being able to convert the case class to its generic representation, if we then have no means to actually process a generic HList).

Straight forward so far?

The first implicit argument is a bit more interesting. Functionally, all LabelledGeneric.Aux[ARepr] is doing is asking the compiler to make sure we have an implicit LabelledGeneric instance that can handle converting between our parameter A (the case class input type) and Repr (the HList representation). This implicit means that if we try to pass some type A to this method, the compiler will check that we have a shapeless LabelledGeneric that can handle it - if not, we will get a compile error.

But things get more interesting if we look more at what the .Aux is doing!


Path dependent types & the Aux pattern

The best way to work out what is going on is to just jump into the Shapeless code and have a dig. I will proceed to use Generic as an example, as its a simpler case, but its the same for LabelledGeneric:

That's a lot simpler than I expected to find, to be honest, but as you can see from the above example, there are two types involved, there is the trait parameter T and the inner type Repr, and the Generic trait is just concerned with converting between these two types.

The inner type, Repr, is what is called a path dependent type, in scala. That is, the type is dependent on the actual instance of the enclosing trait or class. This is a powerful mechanism in Scala (but one that can also catch you out, if you are in the habit of defining classes etc within other classes or traits). This is an important detail for our Generic here, as it could be given any parameter T, so the corresponding HList could be anything, but this makes sure it must match the given case class T - that is, the Repr is dependent on what T is.

To try and get our head around it, let's take a look at an example:

Cool, so as we expected, we can see that in our Generic example, we can also see that the type Repr has been defined matching the HList representation of our case class. It makes sense that we want the transformed output HList to have its own, specific type (based on whatever input it was transforming), but it would be a real pain to have to actually define that as a type parameter in the class along with our case class type, so it uses this path-dependent type approach.

So, we still haven't got any closer to what this Aux method is doing, so let's dig into that..


The Aux Pattern

We can see from our code that the Aux method is taking two parameters, firstly A - which is the parameter that our Generic will take, but the Aux method also takes the parameter Repr - which we know (or at least pretend we can guess) corresponds to the path dependent type that is defined nested inside the Generic trait.

The best way to work out what is going on from is to take a look at the shapeless code!

As we can see, the Aux type (this is defined within the Generic object) is just an alias for a Generic[T], where the inner path-dependent type is defined as Repr - they have a pretty decent explanation of what is going on in the comments, so I will reproduce that here:

(that is abbreviated for the more relevant bits - they have even more detail in the comments that can be read).

That pretty nicely sums up the Aux pattern - the pattern allows us to essentially promote the result of a type-level computation to the higher level parameter. It can be used for a variety of things where we want to reason about the path dependent types, besides this, but this is a common use for the pattern.



So thats all I wanted to get into for now - you can see the code here, and hopefully with this overview, and the earlier shapeless overview, you can get an understanding of what the LabelledGeneric stuff is doing and how Shapeless is helping me generate React components.

Generic programming in Scala with Shapeless

In my last post about evolutionary computing, I mentioned I started building the project partly just so I could have a play with Shapeless. Shapeless is a generic programming library for Scala, which is growing in popularity - but can be fairly complicated to get started with.

I had used Shapeless from a distance up until this point - using libraries like Circe or Spray-Json-Shapeless that use Shapeless under the hood to do stuff like JSON de/serialisation without boilerplate overhead of having to define all the different case classes/sealed traits that we want to serialise.

You can read the previous post for more details (and the code) if you want to understand exactly what the goal was, but to simplify, the first thing I needed to achieve was for a given case class, generate a brand new instance.

Now I wanted the client of the library to be able to pass in any case class they so wished, so I needed to be able to generically inspect all attributes of any given case class* and then decide how to generate a new instance for it.

Thinking about this problem in traditional JVM terms, you may be thinking you could use reflection - which is certainly one option, although some people still have a strong dislike for reflection in the JVM, and also has the downside that its not type safe - that is, if someone passes in some attribute that you don't want to support (a DateTime, UUID, etc) then it would have to be handled at runtime (which also means more exception handling code, as well as loosing the compile time safety). Likewise, you could just ask clients of the code to pass in a Map of the attributes, which dodges the need for reflection but still has no type safety either.



Enter Shapeless

Shapeless allows us to represent case classes and sealed traits in a generic structure, that can be easily inspected, transformed and put into new classes.

A case class, product and HList walk into a bar

A case class can be thought of as a product (in the functional programming, algebraic data type sense), that is, case class Person(name: String, age: Int, living: Boolean) is the product of name AND age AND living - that is, every time you have an instance of that case class you will always have an instance of name AND age AND living. Great. So, as well as being a product, the case class in this example also carries semantic meaning in the type itself - that is, for this instance as well as having these three attributes we also know an additional piece of information that this is a Person. This is of course super helpful, and kinda central to the idea of a type system! But maybe sometimes, we want to be able to just generically (but still type safe!) say I have some code that doesn't care about the semantics of what something is, but if it has three attributes of String, Int, Boolean, then I can handle it - and that's what Shapeless provides - It provides a structure called a HList (a Heterogeneous List) that allows you to define a type as a list of different types, but in a compile time checked way.

Rather that a standard List in Scala (or Java) where by we define the type that the list will contain, with a HList, we can define the list as specific types, for example:
The above example allows us to define out HList with specific types, which provides our compile time safety - the :: notation is some syntactical sugar provided by Shapeless to mirror normal Scala list behaviour, and HNil is the type for an empty HList.

Case class to HList and back again

Ok, cool - we have a way to explicitly define very specific heterogeneous lists of types - how can this help us? We might not want to be explicitly defining these lists everywhere. Once again Shapeless steps in and provides a type class called Generic.

The Generic type class allows conversion of a case class to a HList, and back again, for example:
Through compile time macros, Shapeless provides us these Generic type classes for any case class. We can then use these Generic classes to convert to and from case classes and HLists:



Back to the problem at hand

So, we now have the means to convert any given case class to a HList (that can be inspected, traversed etc) - which is great, as this means we can allow our client to pass in any case class and we just need to be able to handle the generation of/from a HList.

To start with, we will define our type class for generation:

Now, in the companion object we will define a helper method, that can be called simply without needing anything passed into scope, and the necessary Generator instance will be provided implicitly (and if there aren't any appropriate implicits, then we get our compile time error, which is our desired behaviour)

As you can see, we can just call the helper method with an appropriate type argument and as long as there is an implicit Generator in scope for that type, then it will invoke that generate method and *hopefully* generate a new instance of that type.

Next, we need to define some specific generators, for now, I will just add the implicits to support some basic types:
All simple so far - if we call Generator.generate[Int] then it gets the implicit Generator[Int] and generates a random int (not that useful, perhaps, but it works!)

Now comes the part where our generate method can handle a case class being passed in, and subsequently the Shapeless HList representations of a case class (we will recursively traverse a HList structure using appropriate implict generators). First up, lets look at how we can implicitly handle a case class being passed in - at first this may start to look daunting, but its really not that bad.. honestly:
So, what is going on there? The implicit is bound by two types: [T, L <: HList] - which is then used in the implicit argument passed into the method implicit generic: Generic.Aux[T, L] and argument lGen: Generator[L] - this matches for all types T that there exists in scope a Shapeless Generic instance that can convert it into some HList L, and for which L an implicit Generator exists.  Now we know from our earlier look at Shapeless that it will provide a Generic instance for any case class (and case classes will be converted into a HList), so this implicit will be used in the scenario where by we pass in a case class instance, as long as we have an implicit Generator in scope that can handle a HList: So lets add that next!

Let's start with the base-case, as we saw earlier, HNil is the type for an empty HList so lets start with that one:
Remember, the goal is from a given case class (and therefore a given HList) is to generate a new one, so for an empty HList, we just want to return the same.

Next we need to handle a generator for a non-empty HList:
So this case is also relatively simple, because our HList is essentially like a linked list, we will handle it recursively - and the type parameters [H, T <: HList] represent the type of the head of the HList (probably a simple type -  in which case it will try to resolve the implicit using our original Int/Boolean/Double generators) and the tail of the HList which will be another HList (possibly a HNil if we are at the end of the list, otherwise this will resolve recursively to this generator again). We grab the implicit Generator for the head and tail then simply call generate on both and that's really all there is to it! We have defined the implicit for the top level case class (which converts it to the generic HList and then calls generate on that), we have the implicit to recursively traverse the HList structure and we have finally the implicits to handle the simple types that might be in our case class.

Now we can simply define a case class, pass it to our helper method and it will generate a brand new instance for us:

I was using this approach to generate candidates for my evolutionary algorithm, however the exact same approach could be used easily to generate test data (or much more). I used the same approach documented here to later transform instances of case classes as well.


Here is the completed code put together:


Footnote

I have mentioned so far that Shapeless provides the means to transform case classes and sealed traits to generic structures, but have only really talked about case classes. The sealed trait is a slightly different shape to a case class (it is analogous to the algebraic data type co-product) and accordingly, Shapeless has a different structure to handle it, rather than a HList it has a type called Coproduct. The above code can be extended to also handle sealed traits, to do so you just need to add the implicit to handle the empty coproduct (CNil) and recursively handle the Coproduct.

Intro to Genetic Algorithms in Scala

This is intended to be quite a simple high level approach to using some of these ideas and techniques for searching solution spaces. As part of my dissertation back when I was in university, I used this approach to find optimal configurations for Neural Networks. More recently, I have applied this approach to finding good configurations for training an OpenNLP model, with pretty good results - which is really what made me think to write a little about the topic.

So what is it?

Genetic algorithms are a subset of evolutionary computing that borrow techniques from classic evolution theories to help find solutions to problems within a large search space.

So, what does that mean? It sounds a little like a bunch of academic rhetoric that make nothing any clearer, right? But actually, the concept is quite simple, and works like the evolutionary theory of "Survival of the fittest" - that is, you generate a large set of random possible solutions to the problem and see how they all perform at solving the problem, and from there, you "evolve" the solutions allowing the best solutions to progress and evolve, cutting out the worst performing solutions.

How it works

In practical terms, at a high level, it involves the following steps:
  1. Create an encoded representation for the problem solution - This might sound complicated, but this just means capture all the possible inputs or configuration for a solution to the problem.

  2. Randomly generate a set of possible solutions - that is, a set of configurations/inputs (this is called a "population")

  3. Asses the performance of these solutions - to do this, you need a concept of "fitness" or how well a solution performs - and rank the population

  4. Evolve the population of solutions - this can involve a range of techniques of varying complexity, often keeping the top n candidates, dropping the bottom m candidates and then evolving/mutating the rest of the new population

  5. Repeat the rank & evolve cycle (depending on how it performs)


To understand it lets first consider a simple, hypothetical numeric problem, let's say you have some function f(x, y, z) that returns an integer, and we want to find a combination of x, y and z that gives the a value closest to 10,000 (e.g. you want a cuboid with volume of 10,000 and you want to find dimensions for the width, depth and height that achieve this - like I said, this is just hypothetical). The first step would be to encode the solution, which is pretty simple - we have three inputs, so our candidate looks like this:

case class Candidate(x: Int, y: Int, z: Int)

We will also need a fitness function, that will asses how a candidate performs (really, this will just be evaluation of the inputs against our function f)

def fitness(candidate: Candidate): Int = Maths.abs(10000 - (x * y * z))


Now we have a representation of the solution, in generic terms, and we have a function that can measure how well they perform, we can randomly generate a population - the size of the population can be chosen as suits, the bigger the population the longer an evolution cycle will take, but of course gives you a wider gene pool for potential solutions which improves your chances of a better solution.

Once we have our initial population, we can evaluate them all against our fitness functions and rank them - from there, we can evolve!

Survival of the fit, only the strong survive

So how do we evolve our candidates? There are a few techniques, we will look at the simplest here:

  • Survival of the fittest - the simple and sensible approach that the top performing candidate in your population always survives as is (in the scenario where by we have stumbled upon the optimal solution in the first random generation, we want to make sure this is preserved)

  • Mutation - much like you might imagine genes get mutated in normal evolution, we can take a number of candidates and slightly adjust the values in our candidate - for numeric "genes" its easy - we can just adjust the number, a simple popular technique for doing this is Gaussian mutation, which ensures that in most cases the the value will be close to the original value, but there are chances that it could be a larger deviation (it's a bell curve distribution, whereby the peak of the bell curve is the original value).

  • Cross breeding - again, like normal reproduction, randomly select two parent candidates (probably from the top x% of the candidate pool) and take some attributes from each parent to make up a child

Between mutation and cross breeding you can create a new pool of candidates and continue the search for a solution.

Some code

Partly because I wanted to learn a bit more about the Shapeless library (more of that in another post) and partly for fun, I created a simple GitHub project with some code that has very simple evolutionary search machinery that allows you to try and use these techniques.

The shapeless stuff was to support generic encoding of our solution - so the library can easily be used to define a solution representation and a fitness function.

Following on from the example above to find three dimensions that have a volume of 10,000 I have the following configuration:
In the above, we first define the case class ThreeNumbers - this can be any case class you like, but must have Gene[A] as its attributes - these are custom types that I created that have defined generation and mutation methods, and have additional configuration (such as max and min values for numbers etc).  We then pass all the configuration in, along with a fitness function of the form (a: B => Double) and run it.

It varies on each run, as the initial pool of candidates is generated at random, but in most cases this example solves the problem (finds an error rate of zero) within 10 generations:



As mentioned, I have used this approach to train OpenNLP NER models - the different training data generation parameters and the training configuration was all encoded as a candidate class and then used this method to evolve to find the best training configuration for named entity recognition, but it can be applied to all sorts of problems!

Feel free to try it out and let me know if it works well for any problem domains - you can also checkout the code and make it more sophisticated in terms of evolution/mutation techniques and introducing a learning rate.