Retrieving Records Spread Across Partitions
In some cases, the data you are replicating is spread across multiple partitions. You can specify a set of parameters to be iterated over and used while requesting all of your data. On each iteration, using the current element being iterated upon, the connector will perform a cycle of requesting data from your source.
PartitionRouters gives you the ability to specify either a static or dynamic set of elements that will be iterated over one at a time. This in turn is used to route requests to a partition of your data according to the elements iterated over.
The most common use case for the PartitionRouter component is the retrieval of data from an API endpoint that requires extra request inputs to indicate which partition of data to fetch.
Schema:
partition_router:
  default: []
  anyOf:
    - "$ref": "#/definitions/CustomPartitionRouter"
    - "$ref": "#/definitions/ListPartitionRouter"
    - "$ref": "#/definitions/SubstreamPartitionRouter"
    - type: array
      items:
        anyOf:
          - "$ref": "#/definitions/CustomPartitionRouter"
          - "$ref": "#/definitions/ListPartitionRouter"
          - "$ref": "#/definitions/SubstreamPartitionRouter"
Notice that you can specify one or more PartitionRouters on a Retriever. When multiple are defined, the result will be Cartesian product of all partitions and a request cycle will be performed for each permutation.
ListPartitionRouter
ListPartitionRouter iterates over values from a given list. It is defined by
- The partition values, which are the valid values for the cursor field
- The cursor field on a record
- request_option: optional request option to set on outgoing request parameters
Schema:
ListPartitionRouter:
  description: Partition router that is used to retrieve records that have been partitioned according to a list of values
  type: object
  required:
    - type
    - cursor_field
    - slice_values
  properties:
    type:
      type: string
      enum: [ListPartitionRouter]
    cursor_field:
      type: string
    partition_values:
      anyOf:
        - type: string
        - type: array
          items:
            type: string
    request_option:
      "$ref": "#/definitions/RequestOption"
    $parameters:
      type: object
      additionalProperties: true
As an example, this partition router will iterate over the 2 repositories ("airbyte" and "airbyte-secret") and will set a request_parameter on outgoing HTTP requests.
partition_router:
  type: ListPartitionRouter
  values:
    - "airbyte"
    - "airbyte-secret"
  cursor_field: "repository"
  request_option:
    type: RequestOption
    field_name: "repository"
    inject_into: "request_parameter"
SubstreamPartitionRouter
Substreams are streams that depend on the records of another stream
We might for instance want to read all the commits for a given repository (parent stream).
Substreams are implemented by defining their partition router as a SubstreamPartitionRouter.
SubstreamPartitionRouter is used to route requests to fetch data that has been partitioned according to a parent stream's records . We might for instance want to read all the commits for a given repository (parent resource).
- what the parent stream is
- what is the key of the records in the parent stream
- what is the attribute on the parent record that is being used to partition the substream data
- how to specify that attribute on an outgoing HTTP request to retrieve that set of records
Schema:
SubstreamPartitionRouter:
  description: Partition router that is used to retrieve records that have been partitioned according to records from the specified parent streams
  type: object
  required:
    - type
    - parent_stream_configs
  properties:
    type:
      type: string
      enum: [SubstreamPartitionRouter]
    parent_stream_configs:
      type: array
      items:
        "$ref": "#/definitions/ParentStreamConfig"
    $parameters:
      type: object
      additionalProperties: true
Example:
partition_router:
  type: SubstreamPartitionRouter
  parent_streams_configs:
    - stream: "#/repositories_stream"
      parent_key: "id"
      partition_field: "repository"
      request_option:
        type: RequestOption
        field_name: "repository"
        inject_into: "request_parameter"
REST APIs often nest sub-resources in the URL path.
If the URL to fetch commits was "/repositories/:id/commits", then the Requester's path would need to refer to the stream slice's value and no request_option would be set:
Example:
retriever:
  <...>
  requester:
    <...>
    path: "/respositories/{{ stream_slice.repository }}/commits"
  partition_router:
    type: SubstreamPartitionRouter
    parent_streams_configs:
      - stream: "#/repositories_stream"
        parent_key: "id"
        partition_field: "repository"
Nested streams
Nested streams, subresources, or streams that depend on other streams can be implemented using a SubstreamPartitionRouter
More readings
[^1] This is a slight oversimplification. See update cursor section for more details on how the cursor is updated.