Base URL for all requests: https://api.buildsimple.com/extraction/
Used to upload your document for entity extraction. Document data can be uploaded either as file or OCR data, e.g., from previous OCR. The response includes a job id which is used to poll for results using REST interface /entities/
content-type: HTTP content type
supported values: “multipart/form-data; boundary=
required: yes
customer-id: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
The request body includes the following list of form parameters:
document : file containing your document; documents must be limited to a file size of 4 MB; entity extraction is limited to the first 10 pages of the document
supported file types: pdf (single and multi page), tiff (single page and multi page, supported compressions: none, adobe_deflate, ccitt group 3 or 4, lzw) and jpg
required: yes, excludes usage of parameters „text“ and „hocr“
Usage of parameters „document“ and „text“/“hocr“ do exclude each other!
text : OCR text of your document, e.g., resulting from previous OCR
required: yes, requires usage of parameter „hocr“ and excludes usage of parameter „document“
hocr : hOCR data of your document, e.g., resulting from previous OCR
required yes, requires usage of parameter „text“ and excludes usage of parameter „document“
Usage of parameters „text“ and „hocr“ will skip the OCR step of the Entity Extraction API and, hence, significantly improve the request performance!
language : language used for character recognition (OCR)
supported values: [ ”en” | ”de” | ”en+de”]
required: no
default: “en+de”
documentClass : domain of your document; determines the entity types extracted by the Entity Extraction API
supported values: [ ”invoice” | ”contract” ]
required: no
default: determined automatically
useEmbeddedText : use embedded document text to skip OCR step and, hence, improve request performance; only applicable for pdf files when using parameter „document“
supported values: [ „true“ | „false“ ]
required: no
default: „false“
getHocr : return the document’s content in hOCR format (in addition to plain text)
supported values: [ ”true” | ”false” ]
required: no
default: “false”
uploadId : required for processing document files > 4 MB: upload id returned from endpoint GET /uploadurl; each upload id must only be used once
required: no
callbackUrl : callback URL to which the Entity Extraction API sends a HTTP POST request after document processing is finished; the callback request includes a job-id which can be used to call GET /entities/ for the extraction results
required: no
POST headers {"content-type": application/json"} body {"jobId": "229cdae0162805414755d5ee7eed216bc975738c"}
200: Document uploaded successfully.
400: Bad request. Missing or invalid input parameter.
401: Authorization failed. Operation not allowed.
403: Authorization failed due to invalid credentials.
415: Unsupported file format.
429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.
content-type: HTTP content type
supported values: “application /json”
jobId: Job id used for polling the resulting entities from REST interface /entities
type: String
uploadFile: Description of the uploaded file
type: Map
object properties:
name: "size"
type: Integer
name: "mime"
type: String
name: "name"
type: String
POST /document headers {"x-api-key": , "customer-id": } body {"document": }
{ "jobId": "229cdae0162805414755d5ee7eed216bc975738c", "uploadFile": { "size": 30393, "mime": "application/pdf", "name": "Demo.pdf" } }
Used to poll for results from processing of the uploaded document using the job id from REST interface POST /document response.
The response includes the resulting entities which are organised in ungrouped and grouped entities depending on the entity type:
Each entity value includes the following attributes:
content-type: HTTP content type
supported values: “application/json“
required: yes
customer-id: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
JOB_ID : the job id is received from the response of the call to REST interface /document
200: Entity extraction successful.
202: Job processing not finished yet.
204: No entities found.
400: Bad request. Missing or invalid input parameter.
401: Authorization failed. Operation not allowed.
403: Authorization failed due to invalid credentials.
422: File cannot be processed.
429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.
content-type: HTTP content type
supported values: “application /json”
documentClass: domain of the input document
supported values: [ „INVOICE_DE“ | „INVOICE_EN“ | „CONTRACT_DE“ | „CONTRACT_EN“ ]
entities: entities extracted from the input document (see entities)
groups: entity groups containing entity tuples from the input document (see groups)
supported groups: [ “items” | ”taxRates” ]
filename: name of the input document file
text: OCR text read from the input document
hocr: document’s content in hOCR format; enabled via input parameter „getHocr“. The official hOCR specification is available at: https://github.com/kba/hocr-spec
errorMsg: error description; null on success
GET /entities/229cdae0162805414755d5ee7eed216bc975738c headers {"x-api-key": , "customer-id": }
{ "documentClass": "INVOICE_DE", "errorMsg": null, "entities": { "vendor_city": [ { "value": "INGOLSTADT", "originalValue": "INGOLSTADT", "confidence": 0.98210305, "verified": null } ], "vendor_zip": [ { "value": "85046", "originalValue": "85046", "confidence": 0.8447278, "verified": true } ], "vendor_vatNumber": [], "vendor_iban": [], "recipient_street": [ { "value": "Rudolf-Harbig-Weg 26", "originalValue": "Rudolf-Harbig-Weg 26", "confidence": 0.9971908, "verified": null } ], "invoice_invoiceNumber": [ { "value": "458001350", "originalValue": "458001350", "confidence": null, "verified": null } ], "invoice_orderNumber": [], "recipient_accountNumber": [], "invoice_taxRateGroup_taxAmount": [ { "value": "14.06", "originalValue": "14,06", "confidence": null, "verified": true } ], "vendor_taxIdNumber": [], "recipient_city": [ { "value": "Münster", "originalValue": "Münster", "confidence": 0.9588283, "verified": null } ], "recipient_zip": [ { "value": "48149", "originalValue": "48149", "confidence": 0.9821135, "verified": true } ], "invoice_taxRateGroup_netAmount": [], "invoice_deliveryNumber": [], "recipient_company": [], "vendor_bic": [], "vendor_name": [ { "value": "MEDIA MARKT E-BUSINESS GMBH", "originalValue": "MEDIA MARKT E-BUSINESS GMBH", "confidence": null, "verified": null } ], "invoice_dueDate": [], "invoice_invoiceCurrency": [ { "value": "EUR", "originalValue": "€", "confidence": null, "verified": true } ], "vendor_bankName": [], "invoice_deliveryDate": [], "invoice_invoiceDate": [ { "value": "10.08.2017", "originalValue": "10.08.2017", "confidence": 0.9015204, "verified": true }, { "value": "31.08.2017", "originalValue": "31.08.2017", "confidence": 0.8834753, "verified": true } ], "invoice_taxRateGroup_taxRate": [ { "value": "190", "originalValue": "190%", "confidence": null, "verified": null } ], "vendor_street": [ { "value": "WANKELSTRASSE 5", "originalValue": "WANKELSTRASSE 5", "confidence": null, "verified": null } ], "invoice_invoiceGrossAmount": [ { "value": "88.05", "originalValue": "88,05", "confidence": 0.46333426, "verified": true }, { "value": "73.99", "originalValue": "73,99", "confidence": 0.45646423, "verified": false }, { "value": "4.99", "originalValue": "4,99", "confidence": null, "verified": false } ] }, "groups": { "taxRates": [ { "members": { "invoice_taxRateGroup_taxAmount": { "value": "14.06", "originalValue": "14,06", "confidence": null, "verified": true }, "invoice_taxRateGroup_taxRate": { "value": "190", "originalValue": "190%", "confidence": null, "verified": null } }, "verified": null } ], "items": [] }, "text": "MEDIA MARKT E-BUSINESS GMBH “(nnWANKELSTRASSE 5 -nn85046 INGOLSTADTnnTel.: 0841/6344545nnE-Mail: ONLINESHOP@MEDIAMARKT.DEnnRechnungsadresse Rechnung Nr. 458001350nnDaniel Winter Rechnungsdatum 10.08.2017nnRudolf-Harbig-Weg 26nn48149 Münster Kunden-Nr. 3050789nnFällig Am 31.08.2017nnRechnung Betrag €88,05nnMenge Beschreibung Einzelpreis Gesamtpreisnn1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00nn1 Versandkosten 4,99 4,99nSumme Netto 73,99nMwSt. 190% 14,06nn", "filename": "Demo.pdf", "hocr": "<body xmlns=\"http://www.w3.org/1999/xhtml\"> <div class=\"ocr_page\" id=\"page_1\" title=\"ppageno 0; bbox 0 0 1230 1740; image "None"; textangle 0\"> <span class=\"ocrx_word\" id=\"word_1_0\" title=\"bbox 794 55 924 70 x_wconf 95.0 baseline -0.004 0\">MEDIA</span> <span class=\"ocrx_word\" id=\"word_1_1\" title=\"bbox 935 61 941 66 x_wconf 69.0 baseline 0.005 0\">MARKT</span> <span class=\"ocrx_word\" id=\"word_1_2\" title=\"bbox 954 56 1044 70 x_wconf 96.0 baseline 0.006 0\">E-BUSINESS</span> <span class=\"ocrx_word\" id=\"word_1_3\" title=\"bbox 1055 61 1061 66 x_wconf 49.0 baseline 0.0 0\">GMBH</span> <span class=\"ocrx_word\" id=\"word_1_4\" title=\"bbox 1074 55 1162 70 x_wconf 96.0 baseline -0.006 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_1\" x_ner_type=\"vendor_street\">WANKELSTRASSE</span> <span class=\"ocrx_word\" id=\"word_1_5\" title=\"bbox 57 60 98 94 x_wconf 92.0 baseline 0.0 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_2\" x_ner_type=\"vendor_street\">5</span> [...] </div> </body>" }
Used to query the state of multiple jobs
content-type: HTTP content type
supported values: “application/json“
required: yes
customer-id: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
The request body includes a single body parameter that includes the following fields in JSON format:
jobIds : list of job ids received from calling POST /document
type: list
required: yes
200: Query finished successfully.
400: Bad request. Missing or invalid input parameter.
401: Authorization failed. Operation not allowed.
403: Authorization failed due to invalid credentials.
429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.
content-type: HTTP content type
supported values: “application /json”
jobs: map containing job ids from request as keys and job states as values
type: Map
supported job states: [ „PROCESSING“ | „FINISHED“ | „UNKNOWN“ ]
POST /jobs/query headers {"x-api-key": , "customer-id": } body { "jobIds": [ "229cdae0162805414755d5ee7eed216bc975738c", "ab4dcb09e196acd6d859f571b97d94d8dc7fae57", "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836" ] }
{ "jobs": { "229cdae0162805414755d5ee7eed216bc975738c": "FINISHED", "ab4dcb09e196acd6d859f571b97d94d8dc7fae57": "PROCESSING", "1848c3ed29f03ade5cf29c31d7b6dc0665c4d836": "UNKNOWN" } }
Used to request a URL for the upload of large document files (> 4 MB). After uploading your document you must use endpoint POST /document to start document processing.
content-type: HTTP content type
supported values: “application/json“
required: yes
customer-id: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
200: Successfully generated upload URL
401: Authorization failed. Operation not allowed.
403: Authorization failed due to invalid credentials.
content-type: HTTP content type
supported values: “application /json”
uploadUrl: URL for uploading your document file using HTTP REST. Each URL must only be used once.
uploadId: upload id that must be passed as parameter ‚uploadId‘ in POST /document endpoint for processing your uploaded document file
GET /uploadurl headers {"x-api-key": , "customer-id": }
{ "uploadUrl": , "uploadId": "ac9e77e0-13cd-4d81-a4f1-9ba88b52899d" }
Used to upload training samples for training of the extraction models.
content-type: HTTP content type
supported values: “multipart/form-data“
required: yes
customer-id: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
x-api-key: part of your credentials which you receive upon registration for the Entity Extraction API
required: yes
The request body includes the following list of form parameters:
documentClass : document class id of training document
required : yes
supported values : [ „INVOICE“ | „CONTRACT“ ]
language : language id of the training document
required : yes
supported values : [ „en“ | „de“ ]
text : plain text from training document
required: yes
document: training document file
supported file types: [ pdf | tiff | jpg ]
required: yes
entities : entities from training document
required : yes
Must only include supported entities for your document class. Entities may be omitted or may include empty values (i.e., empty array). Each entity must contain a single attribute ‚value‚ that contains the training value as String parameter.
hocr : training document representation in XML-based hOCR format; the original hOCR output for the document file is returned in field ‚hocr‘ in response from GET /entities. The official hOCR specification is available at: https://github.com/kba/hocr-spec
required: yes
All entities need to be marked in hOCR by assigning the attribute ‚x_ner_type‘ to the corresponding ‚ocrx_word‘ items. Each ‚ocrx_word‘ item with attribute ‚x_ner_type‘ needs an additional attribute ‚x_ner_id‘ with a unique id. Entities consisting of multiple words need to use ‚x_ner_id‘ values with consecutive suffixes
uploadId : required for uploading document files > 4 MB: upload id returned from endpoint GET /uploadurl; each upload id must only be used once
required: no
200: Successfully submitted train data.
204: Train data empty.
400: Bad request. Missing or invalid input parameter.
401: Authorization failed. Operation not allowed.
403: Authorization failed due to invalid credentials.
429: Too many requests. The overall number of requests to all REST endpoints of the Entity Extraction API must not exceed 100 requests/s for each customer.
content-type: HTTP content type
supported values: “application /json”
errorMsg: error description; null on success
POST /training headers {"x-api-key": , "customer-id": } body { "documentClass": "INVOICE", "language": "de", "text": "MEDIA MARKT E-BUSINESS GMBH “(nnWANKELSTRASSE 5 -nn85046 INGOLSTADTnnTel.: 0841/6344545nnE-Mail: ONLINESHOP@MEDIAMARKT.DEnnRechnungsadresse Rechnung Nr. 458001350nnDaniel Winter Rechnungsdatum 10.08.2017nnRudolf-Harbig-Weg 26nn48149 Münster Kunden-Nr. 3050789nnFällig Am 31.08.2017nnRechnung Betrag €88,05nnMenge Beschreibung Einzelpreis Gesamtpreisnn1 PIXMA MX475 A4 MFP INJEKT (P) 69,00 69,00nn1 Versandkosten 4,99 4,99nSumme Netto 73,99nMwSt. 190% 14,06nn", "entities": { "vendor_city": [ { "value": "INGOLSTADT" } ], "vendor_zip": [ { "value": "85046" } ], "recipient_street": [ { "value": "Rudolf-Harbig-Weg 26" } ], "invoice_invoiceNumber": [ { "value": "458001350" } ], "invoice_taxRateGroup_taxAmount": [ { "value": "14,06" } ], "recipient_city": [ { "value": "Münster" } ], "recipient_zip": [ { "value": "48149" } ], "vendor_name": [ { "value": "MEDIA MARKT E-BUSINESS GMBH" } ], "invoice_invoiceCurrency": [ { "value": "€" } ], "invoice_invoiceDate": [ { "value": "10.08.2017" } ], "invoice_taxRateGroup_taxRate": [ { "value": "19,0" } ], "vendor_street": [ { "value": "WANKELSTRASSE 5" } ], "invoice_invoiceGrossAmount": [ { "value": "88,05" } ] }, "hocr": "<body xmlns=\"http://www.w3.org/1999/xhtml\"> <div class=\"ocr_page\" id=\"page_1\" title=\"ppageno 0; bbox 0 0 1230 1740; image "None"; textangle 0\"> <span class=\"ocrx_word\" id=\"word_1_0\" title=\"bbox 794 55 924 70 x_wconf 95.0 baseline -0.004 0\">MEDIA</span> <span class=\"ocrx_word\" id=\"word_1_1\" title=\"bbox 935 61 941 66 x_wconf 69.0 baseline 0.005 0\">MARKT</span> <span class=\"ocrx_word\" id=\"word_1_2\" title=\"bbox 954 56 1044 70 x_wconf 96.0 baseline 0.006 0\">E-BUSINESS</span> <span class=\"ocrx_word\" id=\"word_1_3\" title=\"bbox 1055 61 1061 66 x_wconf 49.0 baseline 0.0 0\">GMBH</span> <span class=\"ocrx_word\" id=\"word_1_4\" title=\"bbox 1074 55 1162 70 x_wconf 96.0 baseline -0.006 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_1\" x_ner_type=\"vendor_street\">WANKELSTRASSE</span> <span class=\"ocrx_word\" id=\"word_1_5\" title=\"bbox 57 60 98 94 x_wconf 92.0 baseline 0.0 0\" x_ner_id=\"97d20996-340d-4f43-9e5f-e73277ff9ac6_2\" x_ner_type=\"vendor_street\">5</span> [...] </div> </body>" }