Unloading GitHub data by Xplenty

We've already presented a possible way to create ad-hoc reports on GitHub data, but if you regularly work with information about the repositories it is worth building a package that would collect this information in a regular relational database. Read the guide on building a package processing this data.


We showed a possible way to
create ad-hoc reports on GitHub data, but if you regularly work with information about the repositories it is worth building a package that would collect this information in a regular relational database. At Bitrise we use Xplenty for ETL tasks. Let me guide you building a package processing this data.

Since values in response from GitHub API are URLs themselves we need a couple of steps to process all the data:  get, flatten and parse.

GET

Bitrise's GitHub repositories are organized into multiple organizations. We should collect all the repositories belonging to the organizations and then apply transformations on the results of related endpoints.

REST API source is needed to get the body from the /repos endpoint. Please spend a minute looking at the settings below.

GitHub encourages us to explicitly refer to the API version by adding  the application/vnd.github.v3+json value to the Accept header. The other header we need is Authorization. Create an OAuth application on GitHub in order to avoid rate limiting. Add its token here with token YOUR_APP_TOKEN. It is even more elegant if you set your token as a package variable. Naturally, in this case token $YOUR_VARIABLE should be the value.

$[*] JSON Expression tells Xplenty that it should process the unnamed array from the response. This array, although lacking a name, contains all repositories. The original response from the API looks like

We named all the organizations the repositories of which we are interested in and collected them into one list by sequential union steps. Unfortunately, this step can only use two input tables.

The response header contains links to next pages so we can tick Use pagination at the default settings.

Even more GET

Actually, we are often not only interested in the repositories. Although this endpoint tells the number of open issues we do not know how many we've closed. In order to cope with such a problem or to collect information on pull requests, we initiate a new request on the endpoints named in repository data.

To process JSON data we should add the following Select components.

Curl function with a GET parameter is used to request data from an endpoint which is stored as a value. For example, pull requests against a specific repository:

This function stores the whole JSON response into the field of our table. In the next step, we have to extract the body from this JSON. If we named the field storing the JSON response the function extracting the array of pull requests is the one below.

Note that just like we did with the repositories, we need to extract all in the unnamed array from the body of thepulls field. The type of pulls is a map, a sequence of key-value pairs,  and we can reference the map key with #.

Flatten

Flatten() generates rows for each element of a bag. It also means that the final function in this step is

JsonStringToBag() is applied first since pulls is a JSON string.

The result of this step is a new field, and each record contains one pull request. Let's call the new field pull.

Parse

There is only one step remaining: parsing the JSON key-value pairs into new fields and values. Each field needs

  1. pull converted to JSON
  2. JSON string converted to map
  3. the reference of the field key

The only thing you still need is a destination table in a relational database you favour. Do not forget to add Post-action SQL commands creating indexes on your new table and records.

Get Started for free

Start building now, choose a plan later.

Sign Up

Get started for free

Start building now, choose a plan later.