min read

Unloading GitHub data by Xplenty

We've already presented a possible way to create ad-hoc reports on GitHub data, but if you regularly work with information about the repositories it is worth building a package that would collect this information in a regular relational database. Read the guide on building a package processing this data.

Tamás Srancsik

Data analyst

PUBLISHED:

October 5, 2018

LAST UPDATED :

January 4, 2023

We showed a possible way to create ad-hoc reports on GitHub data, but if you regularly work with information about the repositories it is worth building a package that would collect this information in a regular relational database. At Bitrise we use Xplenty for ETL tasks. Let me guide you building a package processing this data.

Since values in response from GitHub API are URLs themselves we need a couple of steps to process all the data: get, flatten and parse.

GET

Bitrise's GitHub repositories are organized into multiple organizations. We should collect all the repositories belonging to the organizations and then apply transformations on the results of related endpoints.

REST API source is needed to get the body from the /repos endpoint. Please spend a minute looking at the settings below.

GitHub encourages us to explicitly refer to the API version by adding the application/vnd.github.v3+json value to the Accept header. The other header we need is Authorization. Create an OAuth application on GitHub in order to avoid rate limiting. Add its token here with token YOUR_APP_TOKEN. It is even more elegant if you set your token as a package variable. Naturally, in this case token $YOUR_VARIABLE should be the value.

$[*] JSON Expression tells Xplenty that it should process the unnamed array from the response. This array, although lacking a name, contains all repositories. The original response from the API looks like

We named all the organizations the repositories of which we are interested in and collected them into one list by sequential union steps. Unfortunately, this step can only use two input tables.

The response header contains links to next pages so we can tick Use pagination at the default settings.

Even more GET

Actually, we are often not only interested in the repositories. Although this endpoint tells the number of open issues we do not know how many we've closed. In order to cope with such a problem or to collect information on pull requests, we initiate a new request on the endpoints named in repository data.

To process JSON data we should add the following Select components.

Curl function with a GET parameter is used to request data from an endpoint which is stored as a value. For example, pull requests against a specific repository:

This function stores the whole JSON response into the field of our table. In the next step, we have to extract the body from this JSON. If we named the field storing the JSON response the function extracting the array of pull requests is the one below.

Note that just like we did with the repositories, we need to extract all in the unnamed array from the body of thepulls field. The type of pulls is a map, a sequence of key-value pairs, and we can reference the map key with #.

Flatten

Flatten() generates rows for each element of a bag. It also means that the final function in this step is

JsonStringToBag() is applied first since pulls is a JSON string.

The result of this step is a new field, and each record contains one pull request. Let's call the new field pull.

Parse

There is only one step remaining: parsing the JSON key-value pairs into new fields and values. Each field needs

pull converted to JSON
JSON string converted to map
the reference of the field key

The only thing you still need is a destination table in a relational database you favour. Do not forget to add Post-action SQL commands creating indexes on your new table and records.

Get Started for free

Start building now, choose a plan later.

Articles you might like

Developer Tears Saved: You can now accurately measure the pain Bitrise relieved for your mobile teams

Bitrise & Community

min read

Developer Tears Saved: You can now accurately measure the pain Bitrise relieved for your mobile teams

Introducing Developer Tears Saved in Bitrise Insights: Monitor team progress, enhance productivity, and celebrate milestones in real-time, prioritizing developer engagement alongside traditional metrics like DORA.

Barnabas Birmacher

CEO

How to minimize the negative impact of a faulty release with a staged rollout

Bitrise & Community

min read

How to minimize the negative impact of a faulty release with a staged rollout

This blog explores the strategies and tools provided by App Store Connect and Google Play for managing mobile app releases, focusing on minimizing the impact of faulty releases through phased rollouts and staged releases.

Akos Birmacher

Product Manager

Major Linux updates in March 2024: Test, adapt, and be prepared

Bitrise & Community

min read

Major Linux updates in March 2024: Test, adapt, and be prepared

In March 2024 we will release a major update to our Linux stacks that will bring Ubuntu 22.04 Jammy Jellyfish to Bitrise, in accordance with our Stack Update Policy.

Gabor Nadai

R&D Operations

Get started for free

Start building now, choose a plan later.

Talk to Sales