In this article, we'll demonstrate how we can process complex data types stored on S3 in PowerShell. In our case, these are the Bitrise configuration files but it could be any information stored.
Every app on Bitrise is described by a specific YAML configuration: bitrise.yml. YAML's purpose is to be easily readable. For a developer, it is not that important to save the configuration file each time a build runs but for us it holds valuable information for quality assurance and for monitoring performance of steps or stacks.
Structure of bitrise.yml
The schema of the YAML is something like below:
As you can recognize, Bitrise YAMLs serve easy editing and not easy processing. It sets a couple of barriers in front of data crunching:
- Workflow and step names are in position of key thus each workflow name must be captured before referring the key. A real data representation would be a key-value pair for each attribute naming the attribute and defining a value of it:
workflow: build-alpha .
- On the other hand some attributes are named with a generic key, like steps or after_run.
- One workflow is triggered but we should take workflows named inbefore_run and after_run into consideration if we would like to understand which steps were actually run.
As a result of these barriers above, we have a nested document where the nested objects can be sometimes lists, sometimes hash tables in other cases custom objects.
AWS Tools for PowerShell
When a build is running, the complete related YAML file is saved as a snapshot of the application settings at the time of building. Fortunately, AWS has a comprehensive package to work with its cloud services from PowerShell.
If you are a Mac or Linux user, you can only use the cross-platform PowerShell Core but if you are new to PS on Windows, I encourage you to use this, probably more future proof version of the shell. The corresponding package from AWS is AWSPowershell.NetCore, to install it use
To access AWS services we have to set our credentials.
Optionally, you can set the new profile as default by naming it default.
I was working on some historical reports so I needed to download and parse a huge number of files. It seemed to be a good idea to process monthly data parallel.
Fortunately, it is not really complicated in PowerShell to start several tasks at one. Since we would like to perform the same task with different inputs what we need is only one script block and a parameter for each month/job.
Looping over the array of month we just use the piped object this, $_ as a parameter in the script block and passing it as an argument of the job.
- the bucket
- the key of the object, a kind of a path to the specific YAML
- the file to write the S3 object and
- the user profile in case it is not the default we set
From our website, we have lists of repositories and builds of each month which are used to construct the S3 object keys.
Combining what we have so far. We should add our packages and credentials to each session so it should go into the script block.
By running the code above we collect all the builds we are interested in into folders containing data for each month. We also added the metadata to the file names. Let's start with this first.
We can use powershell-yaml package to deal with YAMLs. It is like you would do with JSON: read content and pass to ConvertFrom-Yaml cmdlet:
As mentioned earlier, the nested objects read from the YAML will have various object types. The workflows is not an array (as each workflow's name is used in a key place) but a hash table. GetEnumerator() method can unwrap it into array of objects thus we can loop over the elements.
Attributes of a workflow are in the value of the workflow object.
The same exercise as we did with the workflows should be repeated on the steps.
Collecting all this information into a collection of hash tables:
This collection saved as a JSON can be an input to any document database or Neo4j.
You can find examples among Bitrise CLI Tutorial. This file is based on Complex Workflow lesson.