Skip to content
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ set JSL_OCR_LICENSE=license_key
* Select `SparkOcrSimpleExample.ipynb` notebook.
* Set `secret` and `license` variables to valid values in first cell.
* Run all cells: Runtime -> Run all.
* Restart runtime: Runtime -> Resturt runtime (Need restart first time after installing new packages).
* Restart runtime: Runtime -> Restart runtime (Need restart first time after installing new packages).
* Run all cellls again.

### Run notebooks locally using jupyter
Expand All @@ -66,5 +66,5 @@ jupyter-notebook
* Open `jupyter/SparkOcrSimpleExample.ipynb` notebook.
* Set `secret` and `license` variables to valid values in first cell.
* Run all cells: Cell -> Run all.
* Restart runtime: Kernel -> Resturt (Need restart first time after installing new packages).
* Restart runtime: Kernel -> Restart (Need restart first time after installing new packages).
* Run all cellls again.
262 changes: 1 addition & 261 deletions databricks/python/SparkOcrPdfProcessing.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion databricks/python/SparkOcrSimpleExample.ipynb

Large diffs are not rendered by default.

25 changes: 18 additions & 7 deletions databricks/scala/SparkOcrSimpleExample.scala
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,26 @@ def pipeline() = {
val binaryToImage = new BinaryToImage()
.setInputCol("content")
.setOutputCol("image")

val transformer = new GPUImageTransformer()
.addHuangTransform()
.addScalingTransform(2)
.addDilateTransform(2,2)
.addErodeTransform(2,2)
.setInputCol("image")
.setOutputCol("transformed_image")

// Run OCR
val ocr = new ImageToText()
.setInputCol("image")
.setInputCol("transformed_image")
.setOutputCol("text")
.setConfidenceThreshold(65)

.setModelType("best")
.setLanguage("eng")

new Pipeline().setStages(Array(
binaryToImage,
transformer,
ocr
))
}
Expand All @@ -50,25 +61,25 @@ def pipeline() = {
// COMMAND ----------

// MAGIC %sh
// MAGIC OCR_DIR=/dbfs/tmp/ocr
// MAGIC OCR_DIR=/dbfs/tmp/ocr_1
// MAGIC if [ ! -d "$OCR_DIR" ]; then
// MAGIC mkdir $OCR_DIR
// MAGIC cd $OCR_DIR
// MAGIC wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/ocr/datasets/images.zip
// MAGIC unzip images.zip
// MAGIC wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/ocr/datasets/news.2B.0.png.zip
// MAGIC unzip news.2B.0.png.zip
// MAGIC fi

// COMMAND ----------

display(dbutils.fs.ls("dbfs:/tmp/ocr/images/"))
display(dbutils.fs.ls("dbfs:/tmp/ocr_1/0/"))

// COMMAND ----------

// MAGIC %md ## Read images as binary files from DBFS

// COMMAND ----------

val imagesPath = "/tmp/ocr/images/*.tif"
val imagesPath = "/tmp/ocr_1/0/*.png"
val imagesExampleDf = spark.read.format("binaryFile").load(imagesPath).cache()
display(imagesExampleDf)

Expand Down
411 changes: 411 additions & 0 deletions jupyter/SparkOcrImageTableCellRecognition.ipynb

Large diffs are not rendered by default.

375 changes: 375 additions & 0 deletions jupyter/SparkOcrImageTableDetection.ipynb

Large diffs are not rendered by default.

598 changes: 598 additions & 0 deletions jupyter/SparkOcrImageTableRecognition.ipynb

Large diffs are not rendered by default.

Binary file added jupyter/data/tab_images/cTDaR_t10011.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added jupyter/data/tab_images/cTDaR_t10168.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.