Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py4JError: An error occurred while calling o9368.fit #14375

Closed
1 task done
NSManogna opened this issue Aug 20, 2024 · 5 comments
Closed
1 task done

Py4JError: An error occurred while calling o9368.fit #14375

NSManogna opened this issue Aug 20, 2024 · 5 comments
Assignees

Comments

@NSManogna
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am training NerDLApproach for custom entities. when I increase the size of training data. i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Current Behavior

i am getting this error msg Py4JError: An error occurred while calling o9368.fit and connection is refused

Expected Behavior

To get trained and model training should complete and then can be used for NER of new text

Steps To Reproduce

CoNll.zip

Spark NLP version and Apache Spark

i have launched johnsnowlab on ec2 instance of m5.2xlarge type

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

sparkNLP in johnsnowlab

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

please let me know if any information is needed

@maziyarpanahi
Copy link
Member

Could you please provide the actual code you used to start SparkSession, the pipeline, so we can reproduce it?

@NSManogna
Copy link
Author

The zip file i attached has .ipynb file which consist of the code

@maziyarpanahi
Copy link
Member

Please include the code here or on Google Colab. We are not allowed to download and open zip files for security reasons.

You just need to follow the template, nothing more and nothing less. The issue template is designed based on years of experience.

@NSManogna
Copy link
Author

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(["document"])
.setOutputCol("sentence")

tokenizer = Tokenizer()
.setInputCols(["sentence"])
.setOutputCol("token")

POSTag = PerceptronModel.pretrained()
.setInputCols("document", "token")
.setOutputCol("pos")

chunker = Chunker()
.setInputCols("sentence", "pos")
.setOutputCol("chunk")

embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(["document", "token"])
.setOutputCol("embeddings")

ner_model =NerDLApproach()
.setInputCols(["sentence", "token", "embeddings"])
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(10)
.setLr(0.001)
.setPo(0.005)
.setBatchSize(8)
.setDropout(0.5)
.setValidationSplit(0.2)

ner_converter = NerConverter()
.setInputCols(["sentence","token","ner"])
.setOutputCol("entities")

c_pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
POSTag
])

import pandas as pd
import ast
from pyspark.sql.functions import explode, col

#df = spark.read.csv("pii_dataset.csv", header=True, inferSchema=True)

df=pd.read_csv("pii_dataset.csv")
#df = df.head(1000)

df1 = spark.createDataFrame(df)

f_model=c_pipeline.fit(df1)
result = f_model.transform(df1)

#result.select( explode(col("chunk.result")).alias("chunk_tag")).show(truncate=False)

df_new = df1.join(result.select("text", "pos.result"), on="text", how="left")
df_new = df_new.withColumnRenamed("result", "pos_tags")

#df_new1 = df_new.join(result.select("text", "chunk.result"), on="text", how="left")
#df_new1 = df_new1.withColumnRenamed("result", "chunks")

import ast

df_new2=df_new.toPandas()
df_new2['tokens'] = df_new2['tokens'].apply(ast.literal_eval)
df_new2['labels'] = df_new2['labels'].apply(ast.literal_eval)

selected_df=spark.createDataFrame(df_new2)
rows_as_dicts = selected_df.rdd.map(lambda row: row.asDict()).collect()

def convert_to_conll(sentences):
conll_lines = []
for sentence in sentences:
tokens, labels, pos_tags = sentence['tokens'], sentence['labels'], sentence['pos_tags']
for token, label ,pos_tag in zip(tokens, labels,pos_tags ):
conll_lines.append(f"{token} {pos_tag} \t_ {label}")
conll_lines.append("") # Blank line to separate sentences
return "\n".join(conll_lines)

conll_data = convert_to_conll(rows_as_dicts)

with open('annotations.conll', 'w') as file:
file.write(conll_data)

print("Dataset converted to CoNLL format and saved as 'annotations.conll'.")

nerpipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])

from sparknlp.training import CoNLL

conll_instance = CoNLL()

training_data = conll_instance.readDataset(spark=spark, path ='annotations.conll')

model = nerpipeline.fit(training_data)

Copy link

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Feb 18, 2025
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants