May 2, 2023

Model Building with ChatGPT: Myth or Reality?

chat gpt

ChatGPT has captured the public’s attention with its myriad of use cases, spanning from idea generation and marketing content creation to code debugging and data analysis.

However, amid the Generative AI gold rush, there are still several questions remaining - especially regarding the risks and rewards of generative AI. Alongside the real-world value of ChatGPT that left so many people in awe, the more provocative implications of AI advancements are yet to be uncovered.

For example, a large and looming question remains:

What will happen if ChatGPT puts power in the hands of those that are not prepared or interested in using this technology responsibly?

After all, although ChatGPT wasn’t specifically designed to make programming accessible to the general public, it’s known to be able to generate code, and even malware when prompted by malicious actors on the dark web.

But how about Machine Learning? Can ChatGPT be leveraged to build a viable ML model and potentially serve as an ML engineer? Is it capable of contributing significantly to the future of Machine Learning?

Our goal was to seek some answers, and here’s the approach we took.

Using ChatGPT to build a classification model

After giving it some thought, we decided to run an experiment to investigate how well ChatGPT can handle the task of building a classification model for predicting loan default.

The dataset used for this purpose was the “German Credit” dataset, a well known public dataset for loan default classification problems.

We started with a simple general instruction, asking ChatGPT for code that will train a model on that dataset and then we gradually asked it to update the code to increase the performance or to solve certain issues.

Note: Each block of code only shows the new or updated parts from the previously generated responses.

Prompt 1

Can you give me a snippet of python code that trains a classification model on the publicly available German Credit loan default dataset?

ChatGPT

Sure! Here’s a snippet of Python code that trains a classification model on the German Credit loan default dataset using scikit-learn:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
columns = ['checking_account', 'duration', 'credit_history', 'purpose', 'credit_amount', 'savings_account', 'employment', 'installment_rate', 'personal_status', 'other_debtors', 'residence_since', 'property', 'age', 'other_installment_plans', 'housing', 'existing_credits', 'job', 'num_dependents', 'phone', 'foreign_worker', 'class']
data = pd.read_csv(url, sep=' ', header=None, names=columns)

# Encode categorical features
categorical_features = ['checking_account', 'credit_history', 'purpose', 'savings_account', 'employment', 'personal_status', 'other_debtors', 'property', 'other_installment_plans', 'housing', 'job', 'phone', 'foreign_worker']
le = LabelEncoder()
for feature in categorical_features:
   data[feature] = le.fit_transform(data[feature])

# Split the data into training and testing sets
X = data.drop('class', axis=1)
y = data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the testing set and calculate accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Result

The code compiled without any problems and this was the output:

1
Accuracy: 0.73

The code provided by ChatGPT did a good job on training a simple DecisionTreeClassifier on the specified dataset. It also knew where to find and download that data from and what its features were. Great!

However, it’s worth mentioning that, even though encoding the categorical features using a LabelEncoder inside a for loop worked, the proper approach would have been to use an OrdinalEncoder and encode all features at once.

Without specifying the results of the obtained model, we asked ChatGPT to improve the performance of the model in hope of giving a more complex approach to training the model. Here’s how it went.

Prompt 2

Can you improve the performance of the model?

ChatGPT

Yes, there are several ways to potentially improve the performance of the model. Here are a few suggestions:

  1. Feature engineering: Consider creating new features from existing ones or transforming existing features to better capture the relationships between them and the target variable. For example, you could create a new feature that combines the duration and installment_rate features to capture the total amount paid back per installment.
  2. Hyperparameter tuning: Try different hyperparameter values for the decision tree classifier and see if you can improve its performance. You can use scikit-learn’s GridSearchCV or RandomizedSearchCV to automate this process.
  3. Ensemble methods: Instead of using a single decision tree classifier, you could try using ensemble methods such as random forests or gradient boosting, which can often improve performance.
  4. Data preprocessing: Consider scaling the numerical features, handling missing values, or removing outliers to improve the quality of the data and the performance of the model.

Here’s an updated code snippet that includes some of these suggestions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Create new feature combining duration and installment_rate
data['total_paid'] = data['duration'] * data['installment_rate']

# Scale the numerical features
scaler = StandardScaler()
numerical_features = ['duration', 'credit_amount', 'installment_rate', 'residence_since', 'age', 'existing_credits','num_dependents', 'total_paid']
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200, 500],
    'max_depth': [None, 5, 10, 20]
}
clf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
clf = grid_search.best_estimator_

# Make predictions on the testing set and calculate accuracy and confusion matrix
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)
print('Accuracy:', accuracy)
print('Confusion matrix:\n', conf_mat)
  1. Imbalanced classes: If the dataset has imbalanced classes, you can try using techniques such as oversampling or undersampling to balance the classes and improve the performance of the model.
  2. More advanced models: You could also try using more advanced models such as neural networks or support vector machines (SVMs) to see if they can improve the performance of the model.

These are just a few suggestions for improving the performance of the model. Depending on the specific problem and dataset, there may be other techniques that are more effective.

Result

As you can see above, ChatGPT offered a lot of suggestions of how we could improve model performance, such as hyperparameter tuning, ensemble methods, data preprocessing, addressing class imbalance and using more advanced models.

This shows it has knowledge of the general approach of training a classifier.

Compared to the previously provided code, this time it added a StandardScaler for numerical features, replaced the DecisionTreeClassifier with a RandomForestClassifier and added a grid search to find some better init params for the classifier.

This was the output when running the code:

1
2
3
4
Accuracy: 0.79
Confusion matrix:
  [[131  10]
  [ 32  27]]

As you can see, this newly trained classifier has a better performance than the first one when looking at the accuracy metric.

After confirming that ChatGPT is capable of providing code that does the necessary data preprocessing and performs a grid search for finding better init params to achieve a better performance, we asked it to change the code, so the data preprocessing and the inference are done inside a pipeline.

Prompt 3

Without doing a grid search, can you give me a snippet of code that does all the necessary preprocessing steps for this specific dataset and trains a model for classification? Also, can you do the preprocessing and the inference inside of a pipeline?

ChatGPT

Sure! Here’s an example code snippet that preprocesses the data and trains a random forest classifier using a pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numerical_transformer = StandardScaler()
preprocessor = ColumnTransformer(
   transformers=[
       ('cat', categorical_transformer, categorical_features),
       ('num', numerical_transformer, numerical_features)
   ])

# Train the classifier using a pipeline
clf = Pipeline(steps=[
   ('preprocessor', preprocessor),
   ('classifier', RandomForestClassifier(random_state=42))
])

clf.fit(X_train, y_train)

In this code, the preprocessor transformer applies one-hot encoding to the categorical features and scaling to the numerical features using the ColumnTransformer class.

The clf pipeline applies the preprocessor transformer and trains a random forest classifier. The fit() method is used to train the classifier on the training set, and the predict() method is used to make predictions on the testing set.

The accuracy and confusion matrix are then calculated using the accuracy_score() and confusion_matrix() functions from scikit-learn.

Result

The updated code incorporated the preprocessing of the numerical and categorical features inside a ColumnTransformer, that was subsequently included inside of a Pipeline together with the classifier to successfully complete the task.

Our next step was to compute the classification report for the resulting model, in order to see the actual performance on each class.

1
2
3
4
5
6
7
              precision    recall   f1-score   support
           1       0.81      0.93       0.87       141
           2       0.74      0.49       0.59        59

    accuracy                            0.80       200
   macro avg       0.78      0.71       0.73       200
weighted avg       0.79      0.80       0.79       200

Noticing there is a lower performance on the positive class by comparing the f1-score for each of the predicted classes, we tried giving the chatbot the task of increasing the performance for that class to see what approach it will take.

Prompt 4

The performance for the positive class is lower than the performance for the negative class. How can the code be modified to mitigate that?

ChatGPT

One way to address the issue of imbalanced class distribution and improve the performance of the positive class is to use class weighting in the random forest classifier. This can be done by setting the class_weight parameter of the RandomForestClassifier to 'balanced'.

Here’s the modified code snippet:

1
2
3
4
clf = Pipeline(steps=[
   ('preprocessor', preprocessor),
   ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced'))
])

In this code, the RandomForestClassifier is initialized with the class_weight parameter set to 'balanced', which assigns weights inversely proportional to the class frequencies. This means that the positive class will have a higher weight than the negative class, which should help to improve the performance of the positive class.

The classification_report() function from scikit-learn is used to print the precision, recall, and F1-score for both classes.

Result

For this task, the only change to the previously provided code was to set the RandomForestClassifier parameter class_weight to “balanced” to possibly give a higher weight to the positive class, if the number of samples in the training set corresponding to each class was highly imbalanced.

Even though this is a good strategy when dealing with an imbalanced dataset like the one we used, the performance of the resulting model for the positive class actually decreased if we compare the f1-score.

1
2
3
4
5
6
7
              precision    recall   f1-score   support
           1       0.78      0.94       0.85       141
           2       0.70      0.36       0.47        59

    accuracy                            0.77       200
   macro avg       0.74      0.65       0.66       200
weighted avg       0.75      0.77       0.74       200

Hence, we tried asking for other ways of increasing performance without any oversampling techniques.

Prompt 5

The performance difference between the two classes is still high. What else can be done to solve that other than oversampling?

ChatGPT

If class weighting does not improve the performance of the positive class enough, there are several other techniques that can be used to address class imbalance without oversampling. Here are a few options:

  1. Using different performance metrics: Accuracy can be a misleading metric when dealing with imbalanced classes, as it tends to favor the majority class. Instead, metrics such as precision, recall, and F1-score should be used to evaluate the model’s performance on each class separately. This can provide a better understanding of how well the model is performing for each class and can help to identify where the model needs improvement.
  2. Threshold adjustment: The threshold used for classification can be adjusted to better balance the precision and recall of the positive class. By default, scikit-learn’s predict() method uses a threshold of 0.5, meaning that any predicted probability above 0.5 is classified as positive. However, this threshold can be adjusted to a higher or lower value, depending on the desired trade-off between precision and recall.
  3. Cost-sensitive learning: This approach involves assigning different costs to different types of errors. For example, misclassifying a positive instance as negative may be more costly than the opposite. By adjusting the costs associated with each type of error, the model can be trained to minimize the total cost of misclassification.
  4. Ensembling: Ensembling techniques such as bagging and boosting can be used to improve the performance of the model on the positive class. By combining multiple models, each trained on a different subset of the data or using a different algorithm, the model can capture a wider range of patterns in the data and achieve better performance.

Here’s an example of how to adjust the classification threshold:

1
2
y_pred_prob = clf.predict_proba(X_test)[:, 1]
y_pred = (y_pred_prob >= 0.4).astype(int) # Adjust threshold to 0.4

In this code, the predict_proba() method is used to get the predicted probabilities of the positive class. The threshold for classification is then set to 0.4, meaning that any predicted probability above 0.4 is classified as positive. This threshold can be adjusted to achieve the desired trade-off between precision and recall.

Result

The code was modified so that after training, the predicted probabilities given by the model for the test set are used together with a threshold of 0.4 to obtain new predictions for the data. The response also contained a correct explanation of how the value of the threshold can provide a trade-off between precision and recall.

Although correct, we wanted the threshold adjustment to be set inside of the prediction pipeline, and this is where the ChatGPT began having problems when trying to provide code that would run without errors and achieve the desired outcome.

Prompt 6

Can you modify the code so that the threshold adjustment is inside of the pipeline?

ChatGPT

Sure! Here’s an example of how to adjust the classification threshold inside a pipeline:

In this code, the classification threshold is set to 0.4 inside the pipeline using a custom transformer. Here’s the code for the transformer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from sklearn.base import BaseEstimator, TransformerMixin

class ThresholdAdjuster(BaseEstimator, TransformerMixin):
   def __init__(self, threshold):
       self.threshold = threshold

   def fit(self, X, y):
       return self

   def transform(self, X):
       y_pred_prob = self.clf.predict_proba(X)[:, 1]
       y_pred = (y_pred_prob >= self.threshold).astype(int)
       return y_pred.reshape(-1, 1)

This transformer takes the classifier (clf) as input and returns the predicted classes based on the specified threshold. The ThresholdAdjuster transformer can then be added to the pipeline like this:

1
2
3
4
5
clf = Pipeline(steps=[
   ('preprocessor', preprocessor),
   ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced')),
   ('threshold_adjuster', ThresholdAdjuster(threshold=0.4))
])

Result

This time around, an instance of a custom class was added in the pipeline after the RandomForestClassifier, but this code could not be run without error as the RandomForestClassifier doesn’t have the necessary fit and transform methods to be able to add any other components after it in the pipeline.

However, it’s worth mentioning that although the provided code had an error, it’s still pretty impressive that it knew how to properly define a custom class by extending scikit-learn classes and overriding some of their methods in the attempt to achieve the desired result.

After several tries to obtain a different output from ChatGPT by regenerating the response, this error would still persist, so we tried to explain what the problem was in the hope of getting a working solution.

Prompt 7

The ThresholdAdjuster can’t be added in the pipeline after the RandomForestClassifier because the RandomForestClassifier does not have the methods fit and transform.

ChatGPT

You’re right, I apologize for the mistake. Here’s a modified version of the code where the ThresholdAdjuster is placed after the RandomForestClassifier in the pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Classification pipeline with threshold adjustment
class ThresholdAdjuster(BaseEstimator, TransformerMixin):
   def __init__(self, threshold):
       self.threshold = threshold

   def fit(self, X, y=None):
       return self

   def predict(self, X):
       y_pred_prob = self.clf.predict_proba(X)[:, 1]
       y_pred = (y_pred_prob >= self.threshold).astype(int)
       return y_pred

clf = Pipeline(steps=[
   ('preprocessor', preprocessor),
   ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced')),
   ('threshold_adjuster', ThresholdAdjuster(threshold=0.4))
])

# Train the classifier
clf.fit(X_train, y_train)

In this modified code, the ThresholdAdjuster transformer is now defined inside the pipeline as a class, and it takes the threshold parameter as input. The predict method of the ThresholdAdjuster transformer is then used to adjust the threshold of the predicted classes. Finally, the pipeline is fitted and the threshold-adjusted predictions are made on the testing set.

Result

After acknowledging the mistake, the response continued by specifying that the ThresholdAdjuster will still be added in the pipeline after the RandomForestClassifier.

This time, the only difference between the initial code and the current one was renaming the transform method to predict, but the provided code still produced the same error as before.

Instead of just explaining why the code fails, we tried adding the returned error to see if the bot would get a better understanding of the problem and how to fix it.

Did it do better this time around?

Prompt 8

When I tried running the code I got the following error: TypeError: “All intermediate steps should be transformers and implement fit and transform or be the string ‘passthrough’ ‘RandomForestClassifier(class_weight=‘balanced’, random_state=42)’ (type <class ‘sklearn.ensemble._forest.RandomForestClassifier’>) doesn’t”. How can the code be modified to fix this?

ChatGPT

This error message is occurring because RandomForestClassifier does not have a transform method, which is required by scikit-learn pipelines.

To fix this, we can wrap the RandomForestClassifier inside a FunctionTransformer which converts the predict method of the RandomForestClassifier to a transformer that implements the transform method. Here’s the updated code:

1
2
3
4
5
6
clf = Pipeline(steps=[
   ('preprocessor', preprocessor),
   ('classifier', FunctionTransformer(RandomForestClassifier(random_state=42, 
class_weight='balanced').predict)),
   ('threshold_adjuster', ThresholdAdjuster(threshold=0.4))
])

In this updated code, we use FunctionTransformer to transform the predict method of RandomForestClassifier into a transformer that can be used in the pipeline. We also move the ThresholdAdjuster transformer after the classifier transformer in the pipeline.

Result

This time around, the generated code produced a different error: “NotFittedError: This RandomForestClassifier instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this estimator.".

This was due to the replacing of the RandomForestClassifier in the pipeline with a FunctionTransformer that uses the predict method of an instance of RandomForestClassifier that was not fitted yet.

Once again, we gave the chatbot the returned error to see if it could find a working solution.

Prompt 9

When I run the code now I get a different error: “NotFittedError: This RandomForestClassifier instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this estimator.". How can the code be modified to fix this?

ChatGPT

This error can occur if the ThresholdAdjuster transformer is called before the RandomForestClassifier has been fitted.

To fix this, we can define a custom estimator that wraps the RandomForestClassifier and adds a predict_proba method to it. This will allow us to fit the entire pipeline, including the ThresholdAdjuster, before making predictions.

Here’s the updated code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Custom estimator for RandomForestClassifier with predict_proba method
class CustomRandomForestClassifier(RandomForestClassifier):
   def predict_proba(self, X):
       return super().predict_proba(X)[:, 1]

# Classification pipeline with threshold adjustment
class ThresholdAdjuster(BaseEstimator, TransformerMixin):
   def __init__(self, threshold):
       self.threshold = threshold

   def fit(self, X, y=None):
       self.clf = CustomRandomForestClassifier(random_state=42, class_weight='balanced')
       self.clf.fit(X, y)
       return self

   def predict(self, X):
       y_pred_prob = self.clf.predict_proba(X)[:, 1]
       y_pred = (y_pred_prob >= self.threshold).astype(int)
       return y_pred

clf = Pipeline(steps=[
   ('preprocessor', preprocessor),
   ('classifier', ThresholdAdjuster(threshold=0.4))
])

In this updated code, we use FunctionTransformer to transform the predict method of RandomForestClassifier into a transformer that can be used in the pipeline. We also move the ThresholdAdjuster transformer after the classifier transformer in the pipeline.

Result

This time, the solution was to first create a custom class CustomRandomForestClassifier that overrides the predict_proba method to return the predicted probabilities for the positive class only, then to modify the fit method of the ThresholdAdjuster to instantiate and fit an instance of the custom random forest classifier.

It also modified the predict method of the ThresholdAdjuster to obtain the prediction probabilities by calling predict_proba of its instance of CustomRandomForrestClassifier (this would return the probabilities only for the positive class) and then again try to select only the probabilities for the positive class prompting the following error: “IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed”

This error would have been avoided either by using an instance of a RandomForrestClassifier instead of the custom one, or by not trying to get the probabilities for the positive class for a second time (replace

1
y_pred_prob = self.clf.predict_proba(X)[:, 1]

with

1
y_pred_prob = self.clf.predict_proba(X) ).

Another more subtle error was that of the assumption that the labels of the negative and positive class are [0, 1]. The provided code (with the mentioned changes) would produce the correct result only if the corresponding labels of the negative and positive classes are the same as the indices of those classes (which was not the case for this dataset as the values of the labels were “1” and “2”).

The way to solve this was to either encode the labels from the start to match the indices or to replace the line of code from ThresholdAdjuster predict from

1
y_pred = (y_pred_prob >= self.threshold).astype(int)

to

1
y_pred = self.clf.classes_[(y_pred_prob >= self.threshold).astype(int)]

Conclusion

Overall, ChatGPT is very good at providing basic code for training models. It has knowledge of publicly available datasets and code libraries, and it’s very good at describing different approaches and techniques used in the process of training a machine learning model in detail.

By being able to create code templates, ML engineers can accelerate their iteration cycle significantly, as opposed to having to write every piece of code from scratch.

However, when faced with the task of providing a more complex implementation, ChatGPT struggles to give a correct solution or a code that runs without errors. Another limitation lies in its inability to autonomously build upon its own prior ideas and deductions without guidance.

ChatGPT provides effective general approaches to address the problem at hand, but it lacks the ability to determine the correct specific steps to take from the outset, without user testing and feedback on its responses.

On top of this, there were some instances when it continued to produce the same error, even when the issue with the provided code was explained.

Obtaining significant insights is a challenging process for humans, as it involves building upon prior experiences and hard-won knowledge. However, ChatGPT is yet to develop this ability and depends on a “prompt engineer” for direction.

Our takeaway?

Despite being a valuable starting point for Machine Learning concepts and strategies, GPT3.5 currently lacks the cognitive depth required for self-sufficient ML engineering.

Needless to say, one should always err on the side of caution when building ML models and remain committed to the responsible use of AI. Generative AI is a technology that seems to evolve very fast, GPT4 already being considered much more capable of complex reasoning than its predecessor.

Keeping this in mind, we do believe that ChatGPT has a place in the future of Machine Learning and, as we move towards more sophisticated generative AI models, it is our responsibility to ensure accountability in a rapidly-evolving technological landscape.

At Lumenova AI, we have made it our mission to empower companies to make Responsible AI a part of their DNA. In an age where opaque decision-making is no longer enough, we are committed to delivering value through our state-of-the-art AI Trust Platform that enables businesses to make AI ethical, fair, and transparent.

Make your AI ethical, transparent, and compliant - with Lumenova AI

Book your demo