Entity Extraction API

The Entity Extraction API offers an asynchronous API for Entity Extraction from invoice and contract documents using two REST interfaces for document upload and result polling

START

1. Overview

 

 

Base URL for all requests:
https://uaz3xro0r4.execute-api.eu-central-1.amazonaws.com/PROD/

POST /document

2. POST /document

Used to upload your document for entity extraction. Document data can be uploaded either as file or OCR data, e.g., from previous OCR. The response includes a job id which is used to poll for results using REST interface /entities/<JOB_ID>

2.1 Request Header

content-type: HTTP content type

supported values: “multipart/form-data; boundary=<SOME_BOUNDARY_STRING>“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

2.2 Request Parameter

The request body includes the following list of form parameters:

document : file containing your document; documents must be limited to a file size of 4 MB; entity extraction is limited to the first 10 pages of the document

supported file types: pdf (single and multi page), tiff (single page and multi page, supported compressions: none, adobe_deflate, ccitt group 3 or 4, lzw) and jpg

required: yes, excludes usage of parameters „text“ and „hocr“

Usage of parameters „document“ and „text“/“hocr“ do exclude each other!

text : OCR text of your document, e.g., resulting from previous OCR

required: yes, requires usage of parameter „hocr“ and excludes usage of parameter „document“

hocr : hOCR data of your document, e.g., resulting from previous OCR

required yes, requires usage of parameter „text“ and excludes usage of parameter „document“

Usage of parameters „text“ and „hocr“ will skip the OCR step of the Entity Extraction API and, hence, significantly improve the request performance!

language : language used for character recognition (OCR)

supported values: [ ”en” | ”de” | ”en+de”]

required: no

default: “en+de”

documentClass : domain of your document; determines the entity types extracted by the Entity Extraction API

supported values: [ ”invoice” | ”contract” ]

required: no

default: determined automatically

useEmbeddedText : use embedded document text to skip OCR step and, hence, improve request performance; only applicable for pdf files when using parameter „document“

supported values: [ „true“ | „false“ ]

required: no

default: „false“

getHocr : return the document’s content in hOCR format (in addition to plain text)

supported values: [ ”true” | ”false” ]

required: no

default: “false”

callbackUrl : callback URL to which the Entity Extraction API sends a HTTP POST request after document processing is finished; the callback request includes a job-id which can be used to call GET /entities/<JOB_ID> for the extraction results

required: no

 

Example Callback

POST <CALLBACK_URL> 

headers {"content-type": application/json"} 
body {"jobId": "229cdae0162805414755d5ee7eed216bc975738c"}

2.3 Response HTTP Status

200: Document uploaded successfully.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

415: Unsupported file format.

429: Usage limit exceeded.

2.4 Response Header

content-type: HTTP content type

supported values: “application /json”

2.5 Response Body

jobId: Job id used for polling the resulting entities from REST interface /entities

type: String

uploadFile: Description of the uploaded file

type: Map<String, Object>

object properties:

  • name: „size“

    type: Integer

  • name: „mime“

    type: String

  • name: „name“

    type: String

2.6 Example

Request

POST /document

headers {"x-api-key": <YOUR_API_KEY>, "customer-id": <YOUR_CUSTOMER_ID>}
body {"document": <YOUR_DOCUMENT_FILE>}

 

Response

{
  "jobId": "229cdae0162805414755d5ee7eed216bc975738c",
  "uploadFile": {
    "size": 30393,
    "mime": "application/pdf",
    "name": "Demo.pdf"
  }
}

 

GET /entities/

3. GET /entities/<job_id>

Used to poll for results from processing of the uploaded document using the job id from REST interface POST /document response.

 

The response includes the resulting entities which are organised in ungrouped and grouped entities depending on the entity type:

  • Ungrouped entities (field ‘entities’) consist of an entity name and a list of 0..n entity values which are sorted according to decreasing probability, i.e., the first value is the most likely result.

     

    Each entity value includes the following attributes:

    • originalValue: OCR value that was read from the document

    • value: normalized value for specific entity types, e.g., for currency, ‚€‘ is replaced by ‚EUR‘

    • confidence: float value between 0..1 which denotes the probability that an entity value is valid, i.e., the Entity Extraction API proposes potentially multiple values for each entity type which might include valid and invalid values

    • verified: boolean flag which denotes that an entity value is valid with respect to a dedicated set of high-level validation rules that are applied by the Entity Extraction API to each entity type, e.g., invoice amounts must be parsable to floating point values

  • Grouped entities (field ‘groups’) consist of a group name and an unsorted list of 0..n group entities.

    Currently supported group types:

    • taxRates: includes entities ‚invoice_taxRateGroup_taxRate‘, ‚invoice_taxRateGroup_taxAmount’‚ and ‚invoice_taxRateGroup_netAmount‘

    • items: includes entities ‚item_group_quantity‘, ‚item_group_singleNetAmount‘ and ‚item_group_totalNetAmount

  • Each group entity includes the following attributes:

    • members: a tuple of entity values (see above) that are related to each other; each group type is assigned to a static set of entity types; in a particular group entity, each entity type can be included exactly once or can be missing due to suboptimal extraction results

    • verified: boolean flag which denotes that a group entity is consistent with respect to a dedicated set of high-level validation rules that are applied by the Entity Extraction API to each group type, e.g., ‚taxRate‘ * ’netAmount‘ = ‚taxAmount‘

3.1 Request Header

content-type: HTTP content type

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

3.2 Path Variable

JOB_ID : the job id is received from the response of the call to REST interface /document

3.3 Response HTTP Status

200: Entity extraction successful.

202: Job processing not finished yet.

204: No entities found.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

422: File cannot be processed.

429: Too many requests, decrease polling frequency.

3.4 Response Header

content-type: HTTP content type

supported values: “application /json”

3.5 Response Body

documentClass: domain of the input document

supported values: [ „INVOICE_DE“ | „INVOICE_EN“ | „CONTRACT_DE“ | „CONTRACT_EN“ ]

entities: entities extracted from the input document (see entities)

groups: entity groups containing entity tuples from the input document (see groups)

supported groups: [ “items” | ”taxRates” ]

filename: name of the input document file

text: OCR text read from the input document

hocr: document’s content in hOCR format; enabled via input parameter „getHocr“

errorMsg: error description; null on success

3.6 Example

Request

GET /entities/229cdae0162805414755d5ee7eed216bc975738c
headers {"x-api-key": <YOUR_API_KEY>, "customer-id": <YOUR_CUSTOMER_ID>}

 

Response

{
    "documentClass": "INVOICE_DE",
    "errorMsg": null,
    "entities": {
        "vendor_city": [
            {
                "value": "INGOLSTADT",
                "originalValue": "INGOLSTADT",
                "confidence": 0.98210305,
                "verified": null
            }
        ],
        "vendor_zip": [
            {
                "value": "85046",
                "originalValue": "85046",
                "confidence": 0.8447278,
                "verified": true
            }
        ],
        "vendor_vatNumber": [],
        "vendor_iban": [],
        "recipient_street": [
            {
                "value": "Rudolf-Harbig-Weg 26",
                "originalValue": "Rudolf-Harbig-Weg 26",
                "confidence": 0.9971908,
                "verified": null
            }
        ],
        "invoice_invoiceNumber": [
            {
                "value": "458001350",
                "originalValue": "458001350",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_orderNumber": [],
        "recipient_accountNumber": [],
        "invoice_taxRateGroup_taxAmount": [
            {
                "value": "14.06",
                "originalValue": "14,06",
                "confidence": null,
                "verified": true
            }
        ],
        "vendor_taxIdNumber": [],
        "recipient_city": [
            {
                "value": "Münster",
                "originalValue": "Münster",
                "confidence": 0.9588283,
                "verified": null
            }
        ],
        "recipient_zip": [
            {
                "value": "48149",
                "originalValue": "48149",
                "confidence": 0.9821135,
                "verified": true
            }
        ],
        "invoice_taxRateGroup_netAmount": [],
        "invoice_deliveryNumber": [],
        "recipient_company": [],
        "vendor_bic": [],
        "vendor_name": [
            {
                "value": "MEDIA MARKT E-BUSINESS GMBH",
                "originalValue": "MEDIA MARKT E-BUSINESS GMBH",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_dueDate": [],
        "invoice_invoiceCurrency": [
            {
                "value": "EUR",
                "originalValue": "€",
                "confidence": null,
                "verified": true
            }
        ],
        "vendor_bankName": [],
        "invoice_deliveryDate": [],
        "invoice_invoiceDate": [
            {
                "value": "10.08.2017",
                "originalValue": "10.08.2017",
                "confidence": 0.9015204,
                "verified": true
            },
            {
                "value": "31.08.2017",
                "originalValue": "31.08.2017",
                "confidence": 0.8834753,
                "verified": true
            }
        ],
        "invoice_taxRateGroup_taxRate": [
            {
                "value": "190",
                "originalValue": "190%",
                "confidence": null,
                "verified": null
            }
        ],
        "vendor_street": [
            {
                "value": "WANKELSTRASSE 5",
                "originalValue": "WANKELSTRASSE 5",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_invoiceGrossAmount": [
            {
                "value": "88.05",
                "originalValue": "88,05",
                "confidence": 0.46333426,
                "verified": true
            },
            {
                "value": "73.99",
                "originalValue": "73,99",
                "confidence": 0.45646423,
                "verified": false
            },
            {
                "value": "4.99",
                "originalValue": "4,99",
                "confidence": null,
                "verified": false
            }
        ]
    },
    "groups": {
        "taxRates": [
            {
                "members": {
                    "invoice_taxRateGroup_taxAmount": {
                        "value": "14.06",
                        "originalValue": "14,06",
                        "confidence": null,
                        "verified": true
                    },
                    "invoice_taxRateGroup_taxRate": {
                        "value": "190",
                        "originalValue": "190%",
                        "confidence": null,
                        "verified": null
                    }
                },
                "verified": null
            }
        ],
        "items": []
    },
    "text": "MEDIA MARKT E-BUSINESS GMBH “(\n\nWANKELSTRASSE 5 -\n\n85046 INGOLSTADT\n\nTel.: 0841/6344545\n\nE-Mail: ONLINESHOP@MEDIAMARKT.DE\n\nRechnungsadresse Rechnung Nr. 458001350\n\nDaniel Winter Rechnungsdatum 10.08.2017\n\nRudolf-Harbig-Weg 26\n\n48149 Münster Kunden-Nr. 3050789\n\nFällig Am 31.08.2017\n\nRechnung Betrag €88,05\n\nMenge Beschreibung Einzelpreis Gesamtpreis\n\n1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00\n\n1 Versandkosten 4,99 4,99\nSumme Netto 73,99\nMwSt. 190% 14,06\n\n",
    "filename": "Demo.pdf",
    "hocr": null
}

 

POST /jobs/query

4. POST /jobs/query

Used to query the state of multiple jobs

4.1 Request Header

content-type: HTTP content type

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

4.2 Request Parameter

The request body includes a single body parameter that includes the following fields in JSON format:

jobIds : list of job ids received from calling POST /document

type: list

required: yes

4.3 Response HTTP Status

200: Query finished successfully.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

429: Too many requests, decrease calling frequency.

4.4 Response Header

content-type: HTTP content type

supported values: “application /json”

4.5 Response Body

jobs: map containing job ids from request as keys and job states as values

type: Map

supported job states: [ „PROCESSING“ | „FINISHED“ | „UNKNOWN“ ]

4.6 Example

Request

POST /jobs/query
headers {"x-api-key": <YOUR_API_KEY>, "customer-id": <YOUR_CUSTOMER_ID>}
body {
  "jobIds": [
    "229cdae0162805414755d5ee7eed216bc975738c",
    "ab4dcb09e196acd6d859f571b97d94d8dc7fae57",
    "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836"
  ]
}

 

Response

{
  "jobs": {
    "229cdae0162805414755d5ee7eed216bc975738c": "FINISHED",
    "ab4dcb09e196acd6d859f571b97d94d8dc7fae57": "PROCESSING",
    "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836": "UNKNOWN"
  }
}

 

POST /training

5. POST /training

Used to upload training samples for training of the extraction models.

5.1 Request Header

content-type: HTTP content type

supported values: [ “application/json“ | “multipart/form-data“ ]

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

5.2 Request Parameter

The request body includes a single body parameter that includes the following fields in JSON format:

documentClass : document class id of training document

required : yes

supported values : [ „INVOICE“ | „CONTRACT“ ]

language : language id of the training document

required : yes

supported values : [ „en“ | „de“ ]

text : plain text from training document

required: yes

document : document file

supported file types: [ pdf | tiff | jpg ]

required: no

Usage of parameters „document“ requires content-type „multipart/form-data“. If parameter „document“ is omitted the content-type must be „application/json“.

entities : entities from training document

required : yes

Must only include supported entities  listed at 6. SUPPORTED ENTITY TYPES. Entities may be omitted or may include empty values (i.e., empty array). Entity values may include all attributed described in 3. GET /ENTITIES/<JOB_ID>:

  • originalValue: value used for training
  • value: value used for training if attribute „originalValue“ is empty or omitted; ignored otherwise
  • verified: must be „true“ or omitted, otherwise this entity value is ignored for training
  • confidence: attribute is ignored for training

groups : group entities from training document

required : no

supported groups : [ „items“ | „taxRates“ ]

Must only include supported entities  listed at 6. SUPPORTED ENTITY TYPES. Entities may be omitted or may include empty values (i.e., empty array). Entity values are organised as in field „entities“. All entities can be put into field „entities“ exclusively without affecting the training effect! There is no need to use field „groups“!

5.3 Response HTTP Status

200: Successfully submitted train data.

204: Train data empty.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

5.4 Response Header

content-type: HTTP content type

supported values: “application /json”

5.5 Response Body

errorMsg: error description; null on success

5.6 Example

Request

POST /training
headers {"x-api-key": <YOUR_API_KEY>, "customer-id": <YOUR_CUSTOMER_ID>}
body {
  "documentClass": "INVOICE",
  "language": "de",
  "text": "MEDIA MARKT E-BUSINESS GMBH “(\n\nWANKELSTRASSE 5 -\n\n85046 INGOLSTADT\n\nTel.: 0841/6344545\n\nE-Mail: ONLINESHOP@MEDIAMARKT.DE\n\nRechnungsadresse Rechnung Nr. 458001350\n\nDaniel Winter Rechnungsdatum 10.08.2017\n\nRudolf-Harbig-Weg 26\n\n48149 Münster Kunden-Nr. 3050789\n\nFällig Am 31.08.2017\n\nRechnung Betrag €88,05\n\nMenge Beschreibung Einzelpreis Gesamtpreis\n\n1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00\n\n1 Versandkosten 4,99 4,99\nSumme Netto 73,99\nMwSt. 190% 14,06\n\n",
  "entities": {
    "vendor_city": [
      {
        "value": "INGOLSTADT"
      }
    ],
    "vendor_zip": [
      {
        "value": "85046"
      }
    ],
    "recipient_street": [
      {
        "value": "Rudolf-Harbig-Weg 26"
      }
    ],
    "invoice_invoiceNumber": [
      {
        "value": "458001350"
      }
    ],
    "invoice_taxRateGroup_taxAmount": [
      {
        "value": "14,06"
      }
    ],
    "recipient_city": [
      {
        "value": "Münster"
      }
    ],
    "recipient_zip": [
      {
        "value": "48149"
      }
    ],
    "vendor_name": [
      {
        "value": "MEDIA MARKT E-BUSINESS GMBH"
      }
    ],
    "invoice_invoiceCurrency": [
      {
        "value": "€"
      }
    ],
    "invoice_invoiceDate": [
      {
        "value": "10.08.2017"
      }
    ],
    "invoice_taxRateGroup_taxRate": [
      {
        "value": "19,0"
      }
    ],
    "vendor_street": [
      {
        "value": "WANKELSTRASSE 5"
      }
    ],
    "invoice_invoiceGrossAmount": [
      {
        "value": "88,05"
      }
    ]
  }
}

 

Supported Entity Types

6.1 Invoice Entities

For invoice documents, the Entity Extraction API provides a default set of entities that are extracted. The Buildsimple team may add additional entities to the default entity set for invoice documents in future releases.

6.1.1 Invoice

The following entities are located at field ‚entities‘ of the response from GET /entities/<JOB_ID>

Entity name

Description

invoice_invoiceDate

invoice date

invoice_invoiceNumber

invoice number

invoice_orderNumber

order number

invoice_deliveryDate

delivery date

invoice_invoiceCurrency

invoice currency

invoice_invoiceGrossAmount

invoice gross amount

invoice_dueDate

due date

invoice_deliveryNumber

delivery number

6.1.2 Vendor

The following entities are located at field ‚entities‘ of the response from GET /entities/<JOB_ID>

Entity name

Description

vendor_name

vendor name

vendor_street

vendor street name and house number

vendor_zip

vendor zip code

vendor_city

vendor city

vendor_bankName

name of the vendor’s bank

vendor_iban

vendor IBAN

vendor_bic

vendor BIC

vendor_taxIdNumber

vendor tax id

vendor_vatNumber

vendor VAT number/td>

6.1.3 Recipent

The following entities are located at field ‚entities‘ of the response from GET /entities/<JOB_ID>

Entity name

Description

recipient_company

name of the recipient’s company

recipient_street

recipient street and house number

recipient_zip

recipient zip code

recipient_city

recipient city

6.1.4 Invoice Items

The following entities are located at field ‚members‘ within list ‚groups[‚items‘]‘ of the response from GET /entities/<JOB_ID>.Groups may be incomplete, i.e., contain only 1-5 entities.

Entity name

Description

item_group_quantity

quantity of an invoice item (grouped by invoice item)

item_group_singleNetAmount

single net amount (grouped by invoice item)

item_group_totalNetAmount

total net amount (grouped by invoice item)

item_group_description

description of invoice item (grouped by invoice item)

item_group_materialNumber

material number (grouped by invoice item)

item_group_taxRate

tax rate applied to invoice item (grouped by invoice item)

6.1.5 Tax Rates

The following entities are located at field ‚members‘ within list ‚groups[‚taxRates‘]‘ of the response from GET /entities/<JOB_ID>. Groups may be incomplete, i.e., contain only 1-2 entities.

Entity name

Description

invoice_taxRateGroup_taxRate

tax rate

invoice_taxRateGroup_netAmount

total net amount (grouped by tax rate)

invoice_taxRateGroup_taxAmount

total tax amount (grouped by tax rate)

6.2 Contract Entities

For contract documents, the Entity Extraction API provides a default set of entities that are extracted. The Buildsimple team may add additional entities to the default entity set for invoice documents in future releases.

6.2.1 Contractor

The following entities are located at field ‚entities‚ of the response from GET /entities/<JOB_ID>

Entity name

Description

contractor_name

contractor name

contractor_street

contractor street and house number

contractor_zip

contractor zip code

contractor_city

contractor city

contractor_contact

contact person on contractor side

6.2.2 Contractee

The following entities are located at field ‚entities‚ of the response from GET /entities/<JOB_ID>

Entity name

Description

contractee_name

contractee name

contractee_street

contractee street name and house number

contractee_zip

contractee zip code

contractee_city

contractee city

contractee_contact

contact person on contractor side

6.2.3 Contract

The following entities are located at field ‚entities‚ of the response from GET /entities/<JOB_ID>

Entity name

Description

contract_number

contract number

contract_date

contract date

contract_begin_date

contract begin date

contract_end_date

contract end date

contract_period

duration of contract

contract_object

object of contract

contract_volume

volume of contract

contract_currency

currency of contract volume

Swagger API

7. Swagger API

The following listing provides a Swagger description of the Entity Extraction API in YAML

 

swagger: '2.0'
info:
  description: Extract entities from your documents.
  version: '1.0.0'
  title: ISR Entity Extraction API
host: 'jhguolkp91.execute-api.eu-central-1.amazonaws.com'
basePath: /QA
schemes:
  - https
paths:
  /document:
    post:
      summary: Upload document for entity extraction
      operationId: uploadUsingPOST
      consumes:
        - multipart/form-data
      produces:
        - application/json
      parameters:
        - name: customer-id
          in: header
          description: customer-id
          required: true
          type: string
        - name: x-api-key
          in: header
          description: x-api-key
          required: true
          type: string
        - name: document
          in: formData
          description: document
          required: true
          type: file
        - name: language
          in: formData
          description: language
          required: false
          type: string
          default: en+de
        - name: documentClass
          in: formData
          description: documentClass
          required: false
          type: string
        - name: getHocr
          in: formData
          description: getHocr
          required: false
          type: boolean
          default: false
      responses:
        '200':
          description: OK
          schema:
            $ref: '#/definitions/UploadResponse'
        '400':
          description: Bad request. Missing or invalid input parameter.
        '401':
          description: Authorization failed. Operation not allowed.
        '403':
          description: Authorization failed due to invalid credentials.
        '415':
          description: Unsupported file format.
        '429':
          description: Usage limit exceeded.
        '500':
          description: Internal server error during processing.
        '503':
          description: Required service unavailable.
  /entities/{job-id}:
    get:
      summary: Get entity results
      operationId: getResultUsingGET
      consumes:
        - application/json
      produces:
        - application/json
      parameters:
        - name: customer-id
          in: header
          description: customer-id
          required: true
          type: string
        - name: x-api-key
          in: header
          description: x-api-key
          required: true
          type: string
        - name: job-id
          in: path
          description: job-id
          required: true
          type: string
      responses:
        '200':
          description: OK
          schema:
            $ref: '#/definitions/ExtractResponse'
        '202':
          description: Job processing not finished yet
        '204':
          description: No entities found
        '400':
          description: Bad request. Missing or invalid input parameter.
        '401':
          description: Authorization failed. Operation not allowed.
        '403':
          description: Authorization failed due to invalid credentials.
        '415':
          description: Unsupported file format.
        '429':
          description: Usage limit exceeded.
        '500':
          description: Internal server error during processing.
        '503':
          description: Required service unavailable.
definitions:
  UploadResponse:
    type: object
    properties:
      uploadFile:
        type: object
        additionalProperties:
          type: object
          properties:
            size:
              type: integer
            mime:
              type: string
            name:
              type: string
      jobId:
        type: string
  ExtractResponse:
    type: object
    properties:
      filename:
        type: string
      documentClass:
        type: string
      entities:
        type: object
        additionalProperties:
          type: array
          items:
            $ref: '#/definitions/ExtractEntity'
      groups:
        type: object
        additionalProperties:
          type: array
          items:
            $ref: '#/definitions/GroupEntity'
      hocr:
        type: string
      text:
        type: string
      errorMsg:
        type: string
  GroupEntity:
    type: object
    properties:
      members:
        type: object
        additionalProperties:
          $ref: '#/definitions/ExtractEntity'
      verified:
        type: boolean
  ExtractEntity:
    type: object
    properties:
      confidence:
        type: number
        format: float
      originalValue:
        type: string
      value:
        type: string
      verified:
        type: boolean