Entity Extraction API

The Entity Extraction API offers an asynchronous API for Entity Extraction from invoice and contract documents using two REST interfaces for document upload and result polling

1.Overview

Base URL for all requests:

https://uaz3xro0r4.execute-api.eu-central‑1.amazonaws.com/PROD/

2.POST /document

Used to upload your document for entity extraction. Document data can be uploaded either as file or OCR data, e.g., from previous OCR. The response includes a job id which is used to poll for results using REST interface /entities/

2.1.Request Header

content-type: HTTP content type

supported values: “multipart/form-data; boundary=

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x‑api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

2.2.Request Parameter

The request body includes the following list of form parameters:

document : file containing your document; documents must be limited to a file size of 4 MB; entity extraction is limited to the first 10 pages of the document

supported file types: pdf (single and multi page), tiff (single page and multi page, supported compressions: none, adobe_deflate, ccitt group 3 or 4, lzw) and jpg

required: yes, excludes usage of parameters “text” and “hocr”

Usage of parameters “document” and “text”/“hocr” do exclude each other!

text : OCR text of your document, e.g., resulting from previous OCR

required: yes, requires usage of parameter “hocr” and excludes usage of parameter “document”

hocr : hOCR data of your document, e.g., resulting from previous OCR

required yes, requires usage of parameter “text” and excludes usage of parameter “document”

Usage of parameters “text” and “hocr” will skip the OCR step of the Entity Extraction API and, hence, significantly improve the request performance!

language : language used for character recognition (OCR)

supported values: [ ”en” | ”de” | ”en+de”]

required: no

default: “en+de”

documentClass : domain of your document; determines the entity types extracted by the Entity Extraction API

supported values: [ ”invoice” | ”contract” ]

required: no

default: determined automatically

useEmbeddedText : use embedded document text to skip OCR step and, hence, improve request performance; only applicable for pdf files when using parameter “document”

supported values: [ “true” | “false” ]

required: no

default: “false”

getHocr : return the document’s content in hOCR format (in addition to plain text)

supported values: [ ”true” | ”false” ]

required: no

default: “false”

uploadId : required for processing document files > 4 MB: upload id returned from endpoint GET /uploadurl; each upload id must only be used once

required: no

callbackUrl : callback URL to which the Entity Extraction API sends a HTTP POST request after document processing is finished; the callback request includes a job-id which can be used to call GET /entities/ for the extraction results

required: no

Example Callback

POST  

headers {"content-type": application/json"} 
body {"jobId": "229cdae0162805414755d5ee7eed216bc975738c"}

2.3.Response HTTP Status

200: Document uploaded successfully.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

415: Unsupported file format.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

2.4.Response Header

content-type: HTTP content type

supported values: “application /json”

2.5.Response Body

jobId: Job id used for polling the resulting entities from REST interface /entities

type: String

uploadFile: Description of the uploaded file

type: Map

 

object properties:

  • name: "size"
    type: Integer
  • name: "mime"
    type: String
  • name: "name"
    type: String

2.6.Example

Request

POST /document

headers {"x-api-key": , "customer-id": }
body {"document": }

Response

{
  "jobId": "229cdae0162805414755d5ee7eed216bc975738c",
  "uploadFile": {
    "size": 30393,
    "mime": "application/pdf",
    "name": "Demo.pdf"
  }
}

3.GET /entities/

Used to poll for results from processing of the uploaded document using the job id from REST interface POST /document response.

The response includes the resulting entities which are organised in ungrouped and grouped entities depending on the entity type:

  • Ungrouped entities (field ‘entities’) consist of an entity name and a list of 0..n entity values which are sorted according to decreasing probability, i.e., the first value is the most likely result. 

    Each entity value includes the following attributes:

    • originalValue: OCR value that was read from the document
    • value: normalized value for specific entity types, e.g., for currency, ‘€’ is replaced by ‘EUR’
    • confidence: float value between 0..1 which denotes the probability that an entity value is valid, i.e., the Entity Extraction API proposes potentially multiple values for each entity type which might include valid and invalid values
    • verified: boolean flag which denotes that an entity value is valid with respect to a dedicated set of high-level validation rules that are applied by the Entity Extraction API to each entity type, e.g., invoice amounts must be parsable to floating point values
  • Grouped entities (field ‘groups’) consist of a group name and an unsorted list of 0..n group entities.Currently supported group types:
    • taxRates: includes entities ‘invoice_taxRateGroup_taxRate’, ‘invoice_taxRateGroup_taxAmount’‚ and ‘invoice_taxRateGroup_netAmount’
    • items: includes entities ‘item_group_quantity’, ‘item_group_singleNetAmount’ and ‘item_group_totalNetAmount
  • Each group entity includes the following attributes:
    • members: a tuple of entity values (see above) that are related to each other; each group type is assigned to a static set of entity types; in a particular group entity, each entity type can be included exactly once or can be missing due to suboptimal extraction results
    • verified: boolean flag which denotes that a group entity is consistent with respect to a dedicated set of high-level validation rules that are applied by the Entity Extraction API to each group type, e.g., ‘taxRate’ * ’netAmount’ = ‘taxAmount’

3.1.Request Header

content-type: HTTP content type

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x‑api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

3.2.Path Variable

JOB_ID : the job id is received from the response of the call to REST interface /document

3.3.Response HTTP Status

200: Entity extraction successful.

202: Job processing not finished yet.

204: No entities found.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

422: File cannot be processed.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

3.4.Response Header

content-type: HTTP content type

supported values: “application /json”

3.5.Response Body

documentClass: domain of the input document

supported values: [ „INVOICE_DE“ | „INVOICE_EN“ | „CONTRACT_DE“ | „CONTRACT_EN“ ]

entities: entities extracted from the input document (see entities)

groups: entity groups containing entity tuples from the input document (see groups)

supported groups: [ “items” | ”taxRates” ]

filename: name of the input document file

text: OCR text read from the input document

hocr: document’s content in hOCR format; enabled via input parameter „getHocr“

errorMsg: error description; null on success

3.6.Example

Request

GET /entities/229cdae0162805414755d5ee7eed216bc975738c
headers {"x-api-key": , "customer-id": }

Response

{
    "documentClass": "INVOICE_DE",
    "errorMsg": null,
    "entities": {
        "vendor_city": [
            {
                "value": "INGOLSTADT",
                "originalValue": "INGOLSTADT",
                "confidence": 0.98210305,
                "verified": null
            }
        ],
        "vendor_zip": [
            {
                "value": "85046",
                "originalValue": "85046",
                "confidence": 0.8447278,
                "verified": true
            }
        ],
        "vendor_vatNumber": [],
        "vendor_iban": [],
        "recipient_street": [
            {
                "value": "Rudolf-Harbig-Weg 26",
                "originalValue": "Rudolf-Harbig-Weg 26",
                "confidence": 0.9971908,
                "verified": null
            }
        ],
        "invoice_invoiceNumber": [
            {
                "value": "458001350",
                "originalValue": "458001350",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_orderNumber": [],
        "recipient_accountNumber": [],
        "invoice_taxRateGroup_taxAmount": [
            {
                "value": "14.06",
                "originalValue": "14,06",
                "confidence": null,
                "verified": true
            }
        ],
        "vendor_taxIdNumber": [],
        "recipient_city": [
            {
                "value": "Münster",
                "originalValue": "Münster",
                "confidence": 0.9588283,
                "verified": null
            }
        ],
        "recipient_zip": [
            {
                "value": "48149",
                "originalValue": "48149",
                "confidence": 0.9821135,
                "verified": true
            }
        ],
        "invoice_taxRateGroup_netAmount": [],
        "invoice_deliveryNumber": [],
        "recipient_company": [],
        "vendor_bic": [],
        "vendor_name": [
            {
                "value": "MEDIA MARKT E-BUSINESS GMBH",
                "originalValue": "MEDIA MARKT E-BUSINESS GMBH",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_dueDate": [],
        "invoice_invoiceCurrency": [
            {
                "value": "EUR",
                "originalValue": "€",
                "confidence": null,
                "verified": true
            }
        ],
        "vendor_bankName": [],
        "invoice_deliveryDate": [],
        "invoice_invoiceDate": [
            {
                "value": "10.08.2017",
                "originalValue": "10.08.2017",
                "confidence": 0.9015204,
                "verified": true
            },
            {
                "value": "31.08.2017",
                "originalValue": "31.08.2017",
                "confidence": 0.8834753,
                "verified": true
            }
        ],
        "invoice_taxRateGroup_taxRate": [
            {
                "value": "190",
                "originalValue": "190%",
                "confidence": null,
                "verified": null
            }
        ],
        "vendor_street": [
            {
                "value": "WANKELSTRASSE 5",
                "originalValue": "WANKELSTRASSE 5",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_invoiceGrossAmount": [
            {
                "value": "88.05",
                "originalValue": "88,05",
                "confidence": 0.46333426,
                "verified": true
            },
            {
                "value": "73.99",
                "originalValue": "73,99",
                "confidence": 0.45646423,
                "verified": false
            },
            {
                "value": "4.99",
                "originalValue": "4,99",
                "confidence": null,
                "verified": false
            }
        ]
    },
    "groups": {
        "taxRates": [
            {
                "members": {
                    "invoice_taxRateGroup_taxAmount": {
                        "value": "14.06",
                        "originalValue": "14,06",
                        "confidence": null,
                        "verified": true
                    },
                    "invoice_taxRateGroup_taxRate": {
                        "value": "190",
                        "originalValue": "190%",
                        "confidence": null,
                        "verified": null
                    }
                },
                "verified": null
            }
        ],
        "items": []
    },
    "text": "MEDIA MARKT E-BUSINESS GMBH “(nnWANKELSTRASSE 5 -nn85046 INGOLSTADTnnTel.: 0841/6344545nnE-Mail: ONLINESHOP@MEDIAMARKT.DEnnRechnungsadresse Rechnung Nr. 458001350nnDaniel Winter Rechnungsdatum 10.08.2017nnRudolf-Harbig-Weg 26nn48149 Münster Kunden-Nr. 3050789nnFällig Am 31.08.2017nnRechnung Betrag €88,05nnMenge Beschreibung Einzelpreis Gesamtpreisnn1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00nn1 Versandkosten 4,99 4,99nSumme Netto 73,99nMwSt. 190% 14,06nn",
    "filename": "Demo.pdf",
    "hocr": null
}

4.POST /jobs/query

Used to query the state of multiple jobs

4.1.Request Header

content-type: HTTP content type

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x‑api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

4.2.Request Parameter

The request body includes a single body parameter that includes the following fields in JSON format:

jobIds : list of job ids received from calling POST /document

type: list

required: yes

4.3.Response HTTP Status

200: Query finished successfully.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

4.4.Response Header

content-type: HTTP content type

supported values: “application /json”

4.5.Response Body

jobs: map containing job ids from request as keys and job states as values

type: Map

supported job states: [ “PROCESSING” | “FINISHED” | “UNKNOWN” ]

4.6.Example

Request

POST /jobs/query
headers {"x-api-key": , "customer-id": }
body {
  "jobIds": [
    "229cdae0162805414755d5ee7eed216bc975738c",
    "ab4dcb09e196acd6d859f571b97d94d8dc7fae57",
    "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836"
  ]
}

 

Response

{
  "jobs": {
    "229cdae0162805414755d5ee7eed216bc975738c": "FINISHED",
    "ab4dcb09e196acd6d859f571b97d94d8dc7fae57": "PROCESSING",
    "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836": "UNKNOWN"
  }
}

 

5.GET /uploadurl

Used to request a URL for the upload of large document files (> 4 MB). After uploading your document you must use endpoint POST /document to start document processing.

5.1.Request Header

content-type: HTTP content type 

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x‑api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

5.2.Response HTTP Status

200: Successfully generated upload URL

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

5.3.Response Header

content-type: HTTP content type

supported values: “application /json”

5.4.Response Body

uploadUrl: URL for uploading your document file using HTTP REST. Each URL must only be used once.

uploadId: upload id that must be passed as parameter ‘uploadId’ in POST /document endpoint for processing your uploaded document file

5.5.Example

Request

GET /uploadurl
 
headers {"x-api-key": , "customer-id": }

Response

{
    "uploadUrl": ,
    "uploadId": "ac9e77e0-13cd-4d81-a4f1-9ba88b52899d"
}

6.POST /training

Used to upload training samples for training of the extraction models.

6.1.Request Header

content-type: HTTP content type

supported values: [ “application/json“ | “multipart/form-data“ ]

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x‑api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

6.2.Request Parameter

The request body includes a single body parameter that includes the following fields in JSON format:

documentClass : document class id of training document

required : yes

supported values : [ “INVOICE” | “CONTRACT” ]

language : language id of the training document

required : yes

supported values : [ “en” | “de” ]

text : plain text from training document

required: yes

document : document file

supported file types: [ pdf | tiff | jpg ]

required: no

Usage of parameters “document” requires content-type “multipart/form-data”. If parameter “document” is omitted the content-type must be “application/json”.

entities : entities from training document

required : yes

Must only include supported entities for your document class. Entities may be omitted or may include empty values (i.e., empty array). Each entity must contain a single attribute ‘value’ that contains the training value as String parameter.

6.3.Response HTTP Status

200: Successfully submitted train data.

204: Train data empty.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

6.4.Response Header

content-type: HTTP content type

supported values: “application /json”

6.5.Response Body

errorMsg: error description; null on success

6.6.Example

Request

POST /training
headers {"x-api-key": , "customer-id": }
body {
  "documentClass": "INVOICE",
  "language": "de",
  "text": "MEDIA MARKT E-BUSINESS GMBH “(nnWANKELSTRASSE 5 -nn85046 INGOLSTADTnnTel.: 0841/6344545nnE-Mail: ONLINESHOP@MEDIAMARKT.DEnnRechnungsadresse Rechnung Nr. 458001350nnDaniel Winter Rechnungsdatum 10.08.2017nnRudolf-Harbig-Weg 26nn48149 Münster Kunden-Nr. 3050789nnFällig Am 31.08.2017nnRechnung Betrag €88,05nnMenge Beschreibung Einzelpreis Gesamtpreisnn1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00nn1 Versandkosten 4,99 4,99nSumme Netto 73,99nMwSt. 190% 14,06nn",
  "entities": {
    "vendor_city": [
      {
        "value": "INGOLSTADT"
      }
    ],
    "vendor_zip": [
      {
        "value": "85046"
      }
    ],
    "recipient_street": [
      {
        "value": "Rudolf-Harbig-Weg 26"
      }
    ],
    "invoice_invoiceNumber": [
      {
        "value": "458001350"
      }
    ],
    "invoice_taxRateGroup_taxAmount": [
      {
        "value": "14,06"
      }
    ],
    "recipient_city": [
      {
        "value": "Münster"
      }
    ],
    "recipient_zip": [
      {
        "value": "48149"
      }
    ],
    "vendor_name": [
      {
        "value": "MEDIA MARKT E-BUSINESS GMBH"
      }
    ],
    "invoice_invoiceCurrency": [
      {
        "value": "€"
      }
    ],
    "invoice_invoiceDate": [
      {
        "value": "10.08.2017"
      }
    ],
    "invoice_taxRateGroup_taxRate": [
      {
        "value": "19,0"
      }
    ],
    "vendor_street": [
      {
        "value": "WANKELSTRASSE 5"
      }
    ],
    "invoice_invoiceGrossAmount": [
      {
        "value": "88,05"
      }
    ]
  }
}

7.Supported Entity Types

7.1.Invoice Entities

For invoice documents, the Entity Extraction API provides a default set of entities that are extracted. The Buildsimple team may add additional entities to the default entity set for invoice documents in future releases.

7.1.1.Invoice

The following entities are located at field ‚entities‘ of the response from GET /entities/

Entity nameDescription
invoice_invoiceDateinvoice date
invoice_invoiceNumberinvoice number
invoice_orderNumberorder number
invoice_deliveryDatedelivery date
invoice_invoiceCurrencyinvoice currency
invoice_invoiceGrossAmountinvoice gross amount
invoice_dueDatedue date
invoice_deliveryNumberdelivery number

7.1.2.Vendor

The following entities are located at field ‚entities‘ of the response from GET /entities/

Entity nameDescription
vendor_namevendor name
vendor_streetvendor street name and house number
vendor_zipvendor zip code
vendor_cityvendor city
vendor_bankNamename of the vendor’s bank
vendor_ibanvendor IBAN
vendor_bicvendor BIC
vendor_taxIdNumbervendor tax id
vendor_vatNumbervendor VAT number/td>

7.1.3.Recipent

The following entities are located at field ‚entities‘ of the response from GET /entities/

Entity nameDescription
recipient_companyname of the recipient’s company
recipient_streetrecipient street and house number
recipient_ziprecipient zip code
recipient_cityrecipient city

7.1.4.Invoice Items

The following entities are located at field ‚members‘ within list ‚groups[‚items‘]‘ of the response from GET /entities/.Groups may be incomplete, i.e., contain only 1–5 entities.

Entity nameDescription
item_group_quantityquantity of an invoice item (grouped by invoice item)
item_group_singleNetAmountsingle net amount (grouped by invoice item)
item_group_totalNetAmounttotal net amount (grouped by invoice item)
item_group_descriptiondescription of invoice item (grouped by invoice item)
item_group_materialNumbermaterial number (grouped by invoice item)
item_group_taxRatetax rate applied to invoice item (grouped by invoice item)

7.1.5.Tax Rates

The following entities are located at field ‚members‘ within list ‚groups[‚taxRates‘]‘ of the response from GET /entities/. Groups may be incomplete, i.e., contain only 1–2 entities.

Entity nameDescription
invoice_taxRateGroup_taxRatetax rate
invoice_taxRateGroup_netAmounttotal net amount (grouped by tax rate)
invoice_taxRateGroup_taxAmounttotal tax amount (grouped by tax rate)

7.2.Contract Entities

For contract documents, the Entity Extraction API provides a default set of entities that are extracted. The Buildsimple team may add additional entities to the default entity set for invoice documents in future releases.

7.2.1.Contractor

The following entities are located at field ‚entities‚ of the response from GET /entities/

Entity nameDescription
contractor_namecontractor name
contractor_streetcontractor street and house number
contractor_zipcontractor zip code
contractor_citycontractor city
contractor_contactcontact person on contractor side

7.2.2.Contractee

The following entities are located at field ‚entities‚ of the response from GET /entities/

Entity nameDescription
contractor_namecontractor name
contractor_streetcontractor street and house number
contractor_zipcontractor zip code
contractor_citycontractor city
contractor_contactcontact person on contractor side

7.2.3.Contract

The following entities are located at field ‚entities‚ of the response from GET /entities/

Entity nameDescription
contract_numbercontract number
contract_datecontract date
contract_begin_datecontract begin date
contract_end_datecontract end date
contract_periodduration of contract
contract_objectobject of contract
contract_volumevolume of contract
contract_currencycurrency of contract volume
Suggest Edit

Copy link
Powered by Social Snap