Parsing Complex Documents From S3 In PowerShell

In this article, we'll demonstrate how we can process complex data types stored on S3 in PowerShell. In our case, these are the Bitrise configuration files but it could be any information stored.

In this article, we'll demonstrate how we can process complex data types stored on S3 in PowerShell. In our case, these are the Bitrise configuration files but it could be any information stored.

Every app on Bitrise is described by a specific YAML configuration: bitrise.yml. YAML's purpose is to be easily readable. For a developer, it is not that important to save the configuration file each time a build runs but for us it holds valuable information for quality assurance and for monitoring performance of steps or stacks.

Structure of bitrise.yml

The schema of the YAML is something like below:


format_version: 1.3.1
default_step_lib_source: https://github.com/bitrise-io/bitrise-steplib.git

workflows:

  generic-build:
    steps:
    # steps which depend on `BUILD_TYPE` environment variable
    - git-clone:
        title: Clone repository `BUILD_TYPE`
        ...
    - script:
        title: Running some bash script
        ...
        

  build-alpha:
    envs:
    - BUILD_TYPE: alpha
    after_run:
    - generic-build

  build-beta:
    envs:
    - BUILD_TYPE: beta
    after_run:
    - generic-build
Copy code

As you can recognize, Bitrise YAMLs serve easy editing and not easy processing. It sets a couple of barriers in front of data crunching:

  • Workflow and step names are in position of key thus each workflow name must be captured before referring the key. A real data representation would be a key-value pair for each attribute naming the attribute and defining a value of it:
    workflow: build-alpha .
  • On the other hand some attributes are named with a generic key, like steps or after_run.
  • One workflow is triggered but we should take workflows named inbefore_run and after_run into consideration if we would like to understand which steps were actually run.

As a result of these barriers above, we have a nested document where the nested objects can be sometimes lists, sometimes hash tables in other cases custom objects.

AWS Tools for PowerShell

When a build is running, the complete related YAML file is saved as a snapshot of the application settings at the time of building. Fortunately, AWS has a comprehensive package to work with its cloud services from PowerShell.

If you are a Mac or Linux user, you can only use the cross-platform PowerShell Core but if you are new to PS on Windows, I encourage you to use this, probably more future proof version of the shell. The corresponding package from AWS is AWSPowershell.NetCore, to install it use


Install-Package AWSPowershell.NetCore -Force
Copy code

To access AWS services we have to set our credentials.


Set-AWSCredentials -AccessKey <YourAccessKey> -SecretKey <YourSecret> -StoreAs <YourProfile>
Copy code

Optionally, you can set the new profile as default by naming it default.

Solution

I was working on some historical reports so I needed to download and parse a huge number of files. It seemed to be a good idea to process monthly data parallel.

Parallel processing

Fortunately, it is not really complicated in PowerShell to start several tasks at one. Since we would like to perform the same task with different inputs what we need is only one script block and a parameter for each month/job.


$months = '10', '11', '12'
$months | ForEach-Object {
    $ScriptBlock = {
        param($_)
    }
    
    Start-Job $ScriptBlock -ArgumentList $_
}
Copy code

Looping over the array of month we just use the piped object this, $_ as a parameter in the script block and passing it as an argument of the job.

Downloading

To access files on S3 we can use - no surprise - the Read-S3Object cmdlet. If you plan to access specific files and not complete folders it requires

  • the bucket
  • the key of the object, a kind of a path to the specific YAML
  • the file to write the S3 object and
  • the user profile in case it is not the default we set

Read-S3Object -BucketName <YourBucket> -Key <YourKey> -File <YourFile> -ProfileName <YourProfile>
Copy code

From our website, we have lists of repositories and builds of each month which are used to construct the S3 object keys.


# monthly builds
$inPath = [string]::Format('/Volumes/Bitrise/yaml/builds/{0}.csv', $_)

# path to download the files
$outPath = [string]::Format('/Users/tamas/yamls from aws/yamls/{0}/', $_)

# import flat files to PowerShell object
$reposBuildsWorkflows = Import-Csv -Path $inPath -Header Reposlug, Buildslug, Created, Workflow

# loop over builds
foreach ($build in $reposBuildsWorkflows) {
    try {
        $key = [string]::Format('/build-logs-v2/{0}/{1}/bitrise.yml', $build.Reposlug, $build.Buildslug)
        $file = [string]::Format('{0} {1} {2} {3} {4}.yml', $outPath, $build.Reposlug, $build.Buildslug, $build.Workflow, $build.Created)
        Read-S3Object -BucketName bitrise-build-log-archives-production -Key $key -File $file -ProfileName DataTeamProfile
    }
    catch {
        Write-Host "Something is wrong with" $build.Reposlug $build.Buildslug -ForegroundColor "red"
        $ErrorMessage = $_.Exception.Message
        Write-Host $ErrorMessage
    }
}
Copy code

Downloading parallel

Combining what we have so far. We should add our packages and credentials to each session so it should go into the script block.


$months = '10', '11', '12'
$months | ForEach-Object {
    $ScriptBlock = {
        param($_)

        Install-Package AWSPowershell.NetCore -Force
        Set-AWSCredentials -AccessKey <YourAccessKey> -SecretKey <YourSecretKey> -StoreAs DataTeamProfile

        # monthly builds
        $inPath = [string]::Format('/Users/tamas/yamls from aws//builds/{0}.csv', $_)

        # path to download the files
        $outPath = [string]::Format('/Users/tamas/yamls from aws/yamls/{0}/', $_)

        # import flat files to PowerShell object
        $reposBuildsWorkflows = Import-Csv -Path $inPath -Header Reposlug, Buildslug, Created, Workflow

        # loop over builds
        foreach ($build in $reposBuildsWorkflows) {
            try {
                $key = [string]::Format('/build-logs-v2/{0}/{1}/bitrise.yml', $build.Reposlug, $build.Buildslug)
                $file = [string]::Format('{0} {1} {2} {3} {4}.yml', $outPath, $build.Reposlug, $build.Buildslug, $build.Workflow, $build.Created)
                Read-S3Object -BucketName bitrise-build-log-archives-production -Key $key -File $file -ProfileName DataTeamProfile
            }
            catch {
                Write-Host "Something is wrong with" $build.Reposlug $build.Buildslug -ForegroundColor "red"
                $ErrorMessage = $_.Exception.Message
                Write-Host $ErrorMessage
            }
        }
        
        $downloadedCount = (Get-ChildItem $outPath -Recurse | Measure-Object -Property Length -Sum).Count
        $downloadedSize = (Get-ChildItem $outPath -Recurse | Measure-Object -Property Length -Sum).Sum / 1MB
        Write-Host "downloaded" $downloadedCount "files" $downloadedSize "MB" -ForegroundColor "blue"
        
    }

    # Show the loop variable here is correct
    Write-Host "processing $_..." -ForegroundColor "blue"

    # pass the loop variable across the job-context barrier
    Start-Job $ScriptBlock -ArgumentList $_
}

# Wait for all to complete
While (Get-Job -State "Running") { Start-Sleep 2 }

# Display output from all jobs
Get-Job | Receive-Job

# Cleaning up jobs
Remove-Job *
Copy code

Parsing

By running the code above we collect all the builds we are interested in into folders containing data for each month. We also added the metadata to the file names. Let's start with this first.


$path = "/Users/tamas/yamls from aws/yamls/12/ 3cc373f75f56dba1 3d57df3f8edf0553 ios 2018-12-01 00:24:34.958901.yml"

# retrieving metadata from file names
$pathElements = $path -split "/"
$fileName = $pathElements[6]
$fileNameElements = $fileName -split " "

$repoSlug = $fileNameElements[1]
$buildSlug = $fileNameElements[2]
$triggeredWorkflow = $fileNameElements[3]

$lastElements = $fileNameElements[5] -split ".yml"
$time = $lastElements[0]
$buildRunAt = $fileNameElements[4] + ' ' + $time
Copy code

We can use powershell-yaml package to deal with YAMLs. It is like you would do with JSON: read content and pass to ConvertFrom-Yaml cmdlet:


$textInput = Get-Content -Raw -Path $path
$yamlObject = ConvertFrom-Yaml $textInput
Copy code

As mentioned earlier, the nested objects read from the YAML will have various object types. The workflows is not an array (as each workflow's name is used in a key place) but a hash table. GetEnumerator() method can unwrap it into array of objects thus we can loop over the elements.


$workflows = $yamlObject.workflows.GetEnumerator() | Sort-Object Name
Copy code

Attributes of a workflow are in the value of the workflow object.


$steps = $workflow.Value.steps
$workflowsAfter = $workflow.Value.after_run
Copy code

The same exercise as we did with the workflows should be repeated on the steps.


foreach ($step in $workflow.Value.steps) {
	# a step itself is a hash table, I only needed the name
	$h.steps += $step.GetEnumerator().Name
}
Copy code

Collecting all this information into a collection of hash tables:


$workflowData = foreach ($workflow in $workflows) {
    # hash table to store workflow data
    $h = @{
        repoSlug = $repoSlug
        buildSlug = $buildSlug
        buildRunAt = $buildRunAt
        triggeredWorkflow = $triggeredWorkflow
        workflow = $workflow.Name
    }
    # adding steps into $h if they exist
    if ($workflow.Value.steps) {
        $h.Add("steps", @())
        foreach ($step in $workflow.Value.steps) {
            $h.steps += $step.GetEnumerator().Name
        }
    }
    # did any workflow run before?
    if ($workflow.Value.before_run) {
        $h.Add("beforeRun", @())
        foreach ($before_run in $workflow.Value.before_run) {
            $h.beforeRun += $before_run
        }
    }
    # did any workflow run after?
    if ($workflow.Value.after_run) {
        $h.Add("afterRun", @())
        foreach ($after_run in $workflow.Value.after_run) {
            $h.afterRun += $after_run
        }
    }
    
    $h
}
Copy code

This collection saved as a JSON can be an input to any document database or Neo4j.

Example

You can find examples among Bitrise CLI Tutorial. This file is based on Complex Workflow lesson.

testRepo testBuild analyze 2018-08-01 17-24-34.958901.yml

No items found.

Explore more topics

App development

Best practices from engineers on how to use Bitrise to build better apps, faster.

Community

Meet other Bitrise engineers, technology experts, power users, partners and join our BUGs.

Company

All the updates about Bitrise events, sponsorships, employees, and more.

Insights

Mobile development, latest tech, industry insights, and interviews with experts.

Mobile DevOps

Learn why mobile development is unique and requires a set of unique practices.

Releases

Stay tuned for the last updates, new features, and product improvements.

Get the latest from Bitrise

Join other Mobile DevOps engineers who receive regular emails from Bitrise, filled with tips, news, and best practices.