Entity Extraction API

  • Entity Extraction API

1.Overview

EE API  
Base URL for all requests:

https://uaz3xro0r4.execute-api.eu-central-1.amazonaws.com/PROD/

2.POST /document

Used to upload your document for entity extraction. Document data can be uploaded either as file or OCR data, e.g., from previous OCR. The response includes a job id which is used to poll for results using REST interface /entities/

2.1.Request Header

content-type: HTTP content type

supported values: “multipart/form-data; boundary=

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

2.2.Request Parameter

The request body includes the following list of form parameters:

document : file containing your document; documents must be limited to a file size of 4 MB; entity extraction is limited to the first 10 pages of the document

supported file types: pdf (single and multi page), tiff (single page and multi page, supported compressions: none, adobe_deflate, ccitt group 3 or 4, lzw) and jpg

required: yes, excludes usage of parameters „text“ and „hocr“

 

Usage of parameters „document“ and „text“/“hocr“ do exclude each other!

text : OCR text of your document, e.g., resulting from previous OCR

required: yes, requires usage of parameter „hocr“ and excludes usage of parameter „document“

hocr : hOCR data of your document, e.g., resulting from previous OCR

required yes, requires usage of parameter „text“ and excludes usage of parameter „document“

 

Usage of parameters „text“ and „hocr“ will skip the OCR step of the Entity Extraction API and, hence, significantly improve the request performance!

language : language used for character recognition (OCR)

supported values: [ ”en” | ”de” | ”en+de”]

required: no

default: “en+de”

documentClass : domain of your document; determines the entity types extracted by the Entity Extraction API

supported values: [ ”invoice” | ”contract” ]

required: no

default: determined automatically

useEmbeddedText : use embedded document text to skip OCR step and, hence, improve request performance; only applicable for pdf files when using parameter „document“

supported values: [ „true“ | „false“ ]

required: no

default: „false“

getHocr : return the document’s content in hOCR format (in addition to plain text)

supported values: [ ”true” | ”false” ]

required: no

default: “false”

uploadId : required for processing document files > 4 MB: upload id returned from endpoint GET /uploadurl; each upload id must only be used once

required: no

callbackUrl : callback URL to which the Entity Extraction API sends a HTTP POST request after document processing is finished; the callback request includes a job-id which can be used to call GET /entities/ for the extraction results

required: no

Example Callback

POST  

headers {"content-type": application/json"} 
body {"jobId": "229cdae0162805414755d5ee7eed216bc975738c"}

2.3.Response HTTP Status

200: Document uploaded successfully.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

415: Unsupported file format.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

2.4.Response Header

content-type: HTTP content type

supported values: “application /json”

2.5.Response Body

jobId: Job id used for polling the resulting entities from REST interface /entities

type: String

uploadFile: Description of the uploaded file

type: Map

 

object properties:

  • name: "size"
    type: Integer
  • name: "mime"
    type: String
  • name: "name"
    type: String

2.6.Example

Request
POST /document

headers {"x-api-key": , "customer-id": }
body {"document": }
Response
{
  "jobId": "229cdae0162805414755d5ee7eed216bc975738c",
  "uploadFile": {
    "size": 30393,
    "mime": "application/pdf",
    "name": "Demo.pdf"
  }
}

3.GET /entities/

Used to poll for results from processing of the uploaded document using the job id from REST interface POST /document response.

 

The response includes the resulting entities which are organised in ungrouped and grouped entities depending on the entity type:

  • Ungrouped entities (field ‘entities’) consist of an entity name and a list of 0..n entity values which are sorted according to decreasing probability, i.e., the first value is the most likely result. 

    Each entity value includes the following attributes:

    • originalValue: OCR value that was read from the document
    • value: normalized value for specific entity types, e.g., for currency, ‚€‘ is replaced by ‚EUR‘
    • confidence: float value between 0..1 which denotes the probability that an entity value is valid, i.e., the Entity Extraction API proposes potentially multiple values for each entity type which might include valid and invalid values
    • verified: boolean flag which denotes that an entity value is valid with respect to a dedicated set of high-level validation rules that are applied by the Entity Extraction API to each entity type, e.g., invoice amounts must be parsable to floating point values
  • Grouped entities (field ‘groups’) consist of a group name and an unsorted list of 0..n group entities.Currently supported group types:
    • taxRates: includes entities ‚invoice_taxRateGroup_taxRate‘, ‚invoice_taxRateGroup_taxAmount’‚ and ‚invoice_taxRateGroup_netAmount‘
    • items: includes entities ‚item_group_quantity‘, ‚item_group_singleNetAmount‘ and ‚item_group_totalNetAmount
  • Each group entity includes the following attributes:
    • members: a tuple of entity values (see above) that are related to each other; each group type is assigned to a static set of entity types; in a particular group entity, each entity type can be included exactly once or can be missing due to suboptimal extraction results
    • verified: boolean flag which denotes that a group entity is consistent with respect to a dedicated set of high-level validation rules that are applied by the Entity Extraction API to each group type, e.g., ‚taxRate‘ * ’netAmount‘ = ‚taxAmount‘

3.1.Request Header

content-type: HTTP content type

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

3.2.Path Variable

JOB_ID : the job id is received from the response of the call to REST interface /document

3.3.Response HTTP Status

200: Entity extraction successful.

202: Job processing not finished yet.

204: No entities found.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

422: File cannot be processed.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

3.4.Response Header

content-type: HTTP content type

supported values: “application /json”

3.5.Response Body

documentClass: domain of the input document

supported values: [ „INVOICE_DE“ | „INVOICE_EN“ | „CONTRACT_DE“ | „CONTRACT_EN“ ]

entities: entities extracted from the input document (see entities)

groups: entity groups containing entity tuples from the input document (see groups)

supported groups: [ “items” | ”taxRates” ]

filename: name of the input document file

text: OCR text read from the input document

hocr: document’s content in hOCR format; enabled via input parameter „getHocr“. The official hOCR specification is available at: https://github.com/kba/hocr-spec

errorMsg: error description; null on success

3.6.Example

Request
GET /entities/229cdae0162805414755d5ee7eed216bc975738c
headers {"x-api-key": , "customer-id": }
Response
{
    "documentClass": "INVOICE_DE",
    "errorMsg": null,
    "entities": {
        "vendor_city": [
            {
                "value": "INGOLSTADT",
                "originalValue": "INGOLSTADT",
                "confidence": 0.98210305,
                "verified": null
            }
        ],
        "vendor_zip": [
            {
                "value": "85046",
                "originalValue": "85046",
                "confidence": 0.8447278,
                "verified": true
            }
        ],
        "vendor_vatNumber": [],
        "vendor_iban": [],
        "recipient_street": [
            {
                "value": "Rudolf-Harbig-Weg 26",
                "originalValue": "Rudolf-Harbig-Weg 26",
                "confidence": 0.9971908,
                "verified": null
            }
        ],
        "invoice_invoiceNumber": [
            {
                "value": "458001350",
                "originalValue": "458001350",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_orderNumber": [],
        "recipient_accountNumber": [],
        "invoice_taxRateGroup_taxAmount": [
            {
                "value": "14.06",
                "originalValue": "14,06",
                "confidence": null,
                "verified": true
            }
        ],
        "vendor_taxIdNumber": [],
        "recipient_city": [
            {
                "value": "Münster",
                "originalValue": "Münster",
                "confidence": 0.9588283,
                "verified": null
            }
        ],
        "recipient_zip": [
            {
                "value": "48149",
                "originalValue": "48149",
                "confidence": 0.9821135,
                "verified": true
            }
        ],
        "invoice_taxRateGroup_netAmount": [],
        "invoice_deliveryNumber": [],
        "recipient_company": [],
        "vendor_bic": [],
        "vendor_name": [
            {
                "value": "MEDIA MARKT E-BUSINESS GMBH",
                "originalValue": "MEDIA MARKT E-BUSINESS GMBH",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_dueDate": [],
        "invoice_invoiceCurrency": [
            {
                "value": "EUR",
                "originalValue": "€",
                "confidence": null,
                "verified": true
            }
        ],
        "vendor_bankName": [],
        "invoice_deliveryDate": [],
        "invoice_invoiceDate": [
            {
                "value": "10.08.2017",
                "originalValue": "10.08.2017",
                "confidence": 0.9015204,
                "verified": true
            },
            {
                "value": "31.08.2017",
                "originalValue": "31.08.2017",
                "confidence": 0.8834753,
                "verified": true
            }
        ],
        "invoice_taxRateGroup_taxRate": [
            {
                "value": "190",
                "originalValue": "190%",
                "confidence": null,
                "verified": null
            }
        ],
        "vendor_street": [
            {
                "value": "WANKELSTRASSE 5",
                "originalValue": "WANKELSTRASSE 5",
                "confidence": null,
                "verified": null
            }
        ],
        "invoice_invoiceGrossAmount": [
            {
                "value": "88.05",
                "originalValue": "88,05",
                "confidence": 0.46333426,
                "verified": true
            },
            {
                "value": "73.99",
                "originalValue": "73,99",
                "confidence": 0.45646423,
                "verified": false
            },
            {
                "value": "4.99",
                "originalValue": "4,99",
                "confidence": null,
                "verified": false
            }
        ]
    },
    "groups": {
        "taxRates": [
            {
                "members": {
                    "invoice_taxRateGroup_taxAmount": {
                        "value": "14.06",
                        "originalValue": "14,06",
                        "confidence": null,
                        "verified": true
                    },
                    "invoice_taxRateGroup_taxRate": {
                        "value": "190",
                        "originalValue": "190%",
                        "confidence": null,
                        "verified": null
                    }
                },
                "verified": null
            }
        ],
        "items": []
    },
    "text": "MEDIA MARKT E-BUSINESS GMBH “(nnWANKELSTRASSE 5 -nn85046 INGOLSTADTnnTel.: 0841/6344545nnE-Mail: ONLINESHOP@MEDIAMARKT.DEnnRechnungsadresse Rechnung Nr. 458001350nnDaniel Winter Rechnungsdatum 10.08.2017nnRudolf-Harbig-Weg 26nn48149 Münster Kunden-Nr. 3050789nnFällig Am 31.08.2017nnRechnung Betrag €88,05nnMenge Beschreibung Einzelpreis Gesamtpreisnn1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00nn1 Versandkosten 4,99 4,99nSumme Netto 73,99nMwSt. 190% 14,06nn",
    "filename": "Demo.pdf",
    "hocr": "<body xmlns=\"http://www.w3.org/1999/xhtml\">
  <div class=\"ocr_page\" id=\"page_1\" title=\"ppageno 0; bbox 0 0 1230 1740; image &quot;None&quot;; textangle 0\">
    <span class=\"ocrx_word\" id=\"word_1_0\" title=\"bbox 794 55 924 70 x_wconf 95.0 baseline -0.004 0\">MEDIA</span>
    <span class=\"ocrx_word\" id=\"word_1_1\" title=\"bbox 935 61 941 66 x_wconf 69.0 baseline 0.005 0\">MARKT</span>
    <span class=\"ocrx_word\" id=\"word_1_2\" title=\"bbox 954 56 1044 70 x_wconf 96.0 baseline 0.006 0\">E-BUSINESS</span>
    <span class=\"ocrx_word\" id=\"word_1_3\" title=\"bbox 1055 61 1061 66 x_wconf 49.0 baseline 0.0 0\">GMBH</span>
    <span class=\"ocrx_word\" id=\"word_1_4\" title=\"bbox 1074 55 1162 70 x_wconf 96.0 baseline -0.006 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_1\" x_ner_type=\"vendor_street\">WANKELSTRASSE</span>
    <span class=\"ocrx_word\" id=\"word_1_5\" title=\"bbox 57 60 98 94 x_wconf 92.0 baseline 0.0 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_2\" x_ner_type=\"vendor_street\">5</span>
    [...]
  </div>
</body>"
}

4.POST /jobs/query

Used to query the state of multiple jobs

4.1.Request Header

content-type: HTTP content type

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

4.2.Request Parameter

The request body includes a single body parameter that includes the following fields in JSON format:

jobIds : list of job ids received from calling POST /document

type: list

required: yes

4.3.Response HTTP Status

200: Query finished successfully.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

4.4.Response Header

content-type: HTTP content type

supported values: “application /json”

4.5.Response Body

jobs: map containing job ids from request as keys and job states as values

type: Map

supported job states: [ „PROCESSING“ | „FINISHED“ | „UNKNOWN“ ]

4.6.Example

Request
POST /jobs/query
headers {"x-api-key": , "customer-id": }
body {
  "jobIds": [
    "229cdae0162805414755d5ee7eed216bc975738c",
    "ab4dcb09e196acd6d859f571b97d94d8dc7fae57",
    "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836"
  ]
}
Response
{
  "jobs": {
    "229cdae0162805414755d5ee7eed216bc975738c": "FINISHED",
    "ab4dcb09e196acd6d859f571b97d94d8dc7fae57": "PROCESSING",
    "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836": "UNKNOWN"
  }
}

5.GET /uploadurl

ee api

 

 

 

 

 

 

 

 

 

 

Used to request a URL for the upload of large document files (> 4 MB). After uploading your document you must use endpoint POST /document to start document processing.

5.1.Request Header

content-type: HTTP content type 

supported values: “application/json“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

5.2.Response HTTP Status

200: Successfully generated upload URL

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

5.3.Response Header

content-type: HTTP content type

supported values: “application /json”

5.4.Response Body

uploadUrl: URL for uploading your document file using HTTP REST. Each URL must only be used once.

uploadId: upload id that must be passed as parameter ‚uploadId‘ in POST /document endpoint for processing your uploaded document file

5.5.Example

Request
GET /uploadurl
 
headers {"x-api-key": , "customer-id": }
Response
{
    "uploadUrl": ,
    "uploadId": "ac9e77e0-13cd-4d81-a4f1-9ba88b52899d"
}

6.POST /training

Used to upload training samples for training of the extraction models.

6.1.Request Header

content-type: HTTP content type

supported values: “multipart/form-data“

required: yes

customer-id: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API

required: yes

6.2.Request Parameter

The request body includes the following list of form parameters:

documentClass : document class id of training document

required : yes

supported values : [ „INVOICE“ | „CONTRACT“ ]

language : language id of the training document

required : yes

supported values : [ „en“ | „de“ ]

text : plain text from training document

required: yes

document: training document file

supported file types: [ pdf | tiff | jpg ]

required: yes

entities : entities from training document

required : yes

Must only include supported entities for your document class. Entities may be omitted or may include empty values (i.e., empty array). Each entity must contain a single attribute ‚value‚ that contains the training value as String parameter.

hocr : training document representation in XML-based hOCR format; the original hOCR output for the document file is returned in field ‚hocr‘ in response from GET /entities. The official hOCR specification is available at: https://github.com/kba/hocr-spec

required: yes

All entities need to be marked in hOCR by assigning the attribute ‚x_ner_type‘ to the corresponding ‚ocrx_word‘ items. Each ‚ocrx_word‘ item with attribute ‚x_ner_type‘ needs an additional attribute ‚x_ner_id‘ with a unique id. Entities consisting of multiple words need to use ‚x_ner_id‘ values with consecutive suffixes

uploadId : required for uploading document files > 4 MB: upload id returned from endpoint GET /uploadurl; each upload id must only be used once

required: no

6.3.Response HTTP Status

200: Successfully submitted train data.

204: Train data empty.

400: Bad request. Missing or invalid input parameter.

401: Authorization failed. Operation not allowed.

403: Authorization failed due to invalid credentials.

429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.

6.4.Response Header

content-type: HTTP content type

supported values: “application /json”

6.5.Response Body

errorMsg: error description; null on success

6.6.Example

Request
POST /training
headers {"x-api-key": , "customer-id": }
body {
  "documentClass": "INVOICE",
  "language": "de",
  "text": "MEDIA MARKT E-BUSINESS GMBH “(nnWANKELSTRASSE 5 -nn85046 INGOLSTADTnnTel.: 0841/6344545nnE-Mail: ONLINESHOP@MEDIAMARKT.DEnnRechnungsadresse Rechnung Nr. 458001350nnDaniel Winter Rechnungsdatum 10.08.2017nnRudolf-Harbig-Weg 26nn48149 Münster Kunden-Nr. 3050789nnFällig Am 31.08.2017nnRechnung Betrag €88,05nnMenge Beschreibung Einzelpreis Gesamtpreisnn1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00nn1 Versandkosten 4,99 4,99nSumme Netto 73,99nMwSt. 190% 14,06nn",
  "entities": {
    "vendor_city": [
      {
        "value": "INGOLSTADT"
      }
    ],
    "vendor_zip": [
      {
        "value": "85046"
      }
    ],
    "recipient_street": [
      {
        "value": "Rudolf-Harbig-Weg 26"
      }
    ],
    "invoice_invoiceNumber": [
      {
        "value": "458001350"
      }
    ],
    "invoice_taxRateGroup_taxAmount": [
      {
        "value": "14,06"
      }
    ],
    "recipient_city": [
      {
        "value": "Münster"
      }
    ],
    "recipient_zip": [
      {
        "value": "48149"
      }
    ],
    "vendor_name": [
      {
        "value": "MEDIA MARKT E-BUSINESS GMBH"
      }
    ],
    "invoice_invoiceCurrency": [
      {
        "value": "€"
      }
    ],
    "invoice_invoiceDate": [
      {
        "value": "10.08.2017"
      }
    ],
    "invoice_taxRateGroup_taxRate": [
      {
        "value": "19,0"
      }
    ],
    "vendor_street": [
      {
        "value": "WANKELSTRASSE 5"
      }
    ],
    "invoice_invoiceGrossAmount": [
      {
        "value": "88,05"
      }
    ]
  },
 "hocr": "<body xmlns=\"http://www.w3.org/1999/xhtml\">
  <div class=\"ocr_page\" id=\"page_1\" title=\"ppageno 0; bbox 0 0 1230 1740; image &quot;None&quot;; textangle 0\">
    <span class=\"ocrx_word\" id=\"word_1_0\" title=\"bbox 794 55 924 70 x_wconf 95.0 baseline -0.004 0\">MEDIA</span>
    <span class=\"ocrx_word\" id=\"word_1_1\" title=\"bbox 935 61 941 66 x_wconf 69.0 baseline 0.005 0\">MARKT</span>
    <span class=\"ocrx_word\" id=\"word_1_2\" title=\"bbox 954 56 1044 70 x_wconf 96.0 baseline 0.006 0\">E-BUSINESS</span>
    <span class=\"ocrx_word\" id=\"word_1_3\" title=\"bbox 1055 61 1061 66 x_wconf 49.0 baseline 0.0 0\">GMBH</span>
    <span class=\"ocrx_word\" id=\"word_1_4\" title=\"bbox 1074 55 1162 70 x_wconf 96.0 baseline -0.006 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_1\" x_ner_type=\"vendor_street\">WANKELSTRASSE</span>
    <span class=\"ocrx_word\" id=\"word_1_5\" title=\"bbox 57 60 98 94 x_wconf 92.0 baseline 0.0 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_2\" x_ner_type=\"vendor_street\">5</span>
    [...]
  </div>
</body>"
}
Schlagen  Sie  bearbeiten