word_embedding_and_text_classification.py

# -*- coding: utf-8 -*-
"""Word Embedding and Text Classification.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1awWwqM79EEHfcHblS3i7N3aGZa4WC2dO

# Homework 2, MSBC.5190 Modern Artificial Intelligence S23


**Teammates: Jozie Wille, Lexi Dingle, Charlie Rudy**

**Teamname: XX**

Handout 03/10/2022 4pm, **due 03/24/2022 by 4pm**. Please submit through Canvas. Each team only needs to submit one copy.

Important information about submission:
- Write all code, text (answers), and figures in the notebook.
- Please make sure that the submitted notebook has been run and the cell outputs are visible.
- Please print the notebook as PDF and submit it together with the notebook. Your submission should contain two files: `homework2-teamname.ipynb` and `homework2-teamname.pdf`

The goal of the homework are three folds:


1.   Explore word embedding 
2.   Understand contextual word embedding using BERT
3.   Text classificaiton with both traditional machine learning methods and deep learning methods

**A note about GPU**: You'd better use GPU to run it, otherwise it will be quite slow to train deep learning models.

First, import the packages or modules required for this homework.
"""

!pip install gensim~=3.8.3

################################################################################
# TODO: Fill in your codes                                                     #
# Import packages or modules                                                   #
################################################################################
import tensorflow as tf
import numpy as np
from tensorflow import keras

"""## Part I: Explore Word Embedding (15%)

Word embeddings are useful representation of words that capture information about word meaning as well as location. They are used as a fundamental component for downstream NLP tasks, e.g., text classification. In this part, we will explore the embeddings produced by [GloVe (global vectors for word representation)](https://nlp.stanford.edu/projects/glove/). It is simlar to Word2Vec but differs in their underlying methodology: in GloVe, word embeddings are learned based on global word-word co-occurrence statistics. Both Word2Vec and GloVe tend to produce vector-space embeddings that perform similarly in downstream NLP tasks.

We first load the GloVe vectors
"""

import gensim.downloader as api
# download the model and return as object ready for use
glove_model = api.load('glove-wiki-gigaword-100') 
#load the word vectors from the model     
glove_word_vectors = glove_model.wv

"""Take a look at the vocabulary size and dimensionality of the embedding space"""

print('vocabulary size = ', len(glove_word_vectors.vocab))
print('embedding dimensionality = ', glove_word_vectors['happy'].shape)

"""
What is embedding exactly?"""

# Check word embedding for 'happy'
# You can access the embedding of a word with glove_word_vectors[word] if word
# is in the vocabulary
glove_word_vectors['happy']

"""With word embeddings learned from GloVe or Word2Vec, words with similar semantic meanings tend to have vectors that are close together. Please code and calculate the **cosine similarities** between words based on their embeddings (i.e., word vectors).

For each of the following words in occupation, compute its cosine similarty to 'woman' and its similarity to 'man' and check which gender is more similar. 

*occupation = {homemaker, nurse, receptionist, librarian, socialite, hairdresser, nanny, bookkeeper, stylist, housekeeper, maestro, skipper, protege, philosopher, captain, architect, financier, warrior, broadcaster, magician}*

**Inline Question #1:** 
- Fill in the table below with cosine similarities between words in occupation list and {woman, man}. Please show only two digits after decimal.
- Which words are more similar to 'woman' than to 'man'?
    Words that are significantly more simliar to woman are homemaker, nurse, receptionist, socialite, hairdresser, housekeeper
- Which words are more similar to 'man' than to 'woman'?
  Words that are more similar to man than woman are maestro, captain, architect, financier, warrior, magician
- Do you see any issue here? What do you think might cause these issues?
    there could be issues here becuase the words and their significance play to gender stereotypes. We think these issues could come from the material the algoritm is learning on, if the material is from a time where these stereotypes were heavily enforced it could produce these words to be skewed.

**Your Answer:**

| `similarity`|    woman  |      man     |
|-------------|-----------|--------------|
| homemaker   |     0.43  |      0.24    |
| nurse       |    0.61   |    0.46      |
| receptionist|     0.33  |    0.19      |
| librarian   |    0.34   |     0.23     |
| socialite   |   0.42    |      0.27    |
| hairdresser |  0.39     |      0.26    |
| nanny       | 0.36      |      0.29    |
| bookkeeper  | 0.21      |      0.14    |
| stylist     | 0.31      |      0.25    |
| housekeeper | 0.46      |    0.31      |
| maestro     |  -0.015         |  0.14            |
| skipper     |  0.15         |     0.34         |
| protege     |   0.12        |    0.20          |
| philosopher |  0.23         |    0.28          |
| captain     |  0.31         |    0.53          |
| architect   |  0.22         |    0.30          |
| financier   |  0.14         |    0.26          |
| warrior     |  0.39         |    0.51          |
| broadcaster |  0.23         |    0.25          |
| magician    |  0.27         |    0.38          |

"""

glove_model.similarity(w1='woman', w2='magician')

glove_model.similarity(w1='man', w2='magician')

glove_model.similarity(w1='woman', w2='broadcaster')

glove_model.similarity(w1='man', w2='broadcaster')

glove_model.similarity(w1='man', w2='warrior')

glove_model.similarity(w1='woman', w2='warrior')

glove_model.similarity(w1='woman', w2='financier')

glove_model.similarity(w1='man', w2='financier')

glove_model.similarity(w1='woman', w2='architect')

glove_model.similarity(w1='man', w2='architect')

glove_model.similarity(w1='man', w2='captain')

glove_model.similarity(w1='woman', w2='captain')

glove_model.similarity(w1='woman', w2='philosopher')

glove_model.similarity(w1='man', w2='philosopher')

glove_model.similarity(w1='man', w2='protege')

glove_model.similarity(w1='woman', w2='protege')

glove_model.similarity(w1='man', w2='skipper')

glove_model.similarity(w1='woman', w2='skipper')

glove_model.similarity(w1='man', w2='maestro')

glove_model.similarity(w1='woman', w2='maestro')


glove_model.similarity(w1='man', w2='housekeeper')

glove_model.similarity(w1='woman', w2='housekeeper')

################################################################################
# TODO: Fill in your codes                                                     #                                                          #
################################################################################
glove_model.similarity(w1='woman', w2='homemaker')

glove_model.similarity(w1='man', w2='homemaker')

glove_model.similarity(w1='man', w2='nurse')

glove_model.similarity(w1='woman', w2='nurse')

glove_model.similarity(w1='man', w2='receptionist')

glove_model.similarity(w1='woman', w2='receptionist')

glove_model.similarity(w1='man', w2='librarian')

glove_model.similarity(w1='woman', w2='librarian')

glove_model.similarity(w1='man', w2='socialite')

glove_model.similarity(w1='woman', w2='socialite')

glove_model.similarity(w1='man', w2='hairdresser')

glove_model.similarity(w1='woman', w2='hairdresser')

glove_model.similarity(w1='man', w2='nanny')

glove_model.similarity(w1='woman', w2='nanny')

glove_model.similarity(w1='man', w2='bookkeeper')

glove_model.similarity(w1='woman', w2='bookkeeper')

glove_model.similarity(w1='woman', w2='stylist')

glove_model.similarity(w1='man', w2='stylist')


"""## Part II Understand contextual word embedding using BERT (15%)

A big difference between Word2Vec and BERT is that Word2Vec learns context-free word representations, i.e., the embedding for 'orange' is the same in "I love eating oranges" and in "The sky turned orange". BERT, on the other hand, produces contextual word presentations, i.e., embeddings for the same word in different contexts should be different. 

For example, let us compare the context-based embedding vectors for 'orange' in the following three sentences using Bert:
* "I love eating oranges"
* "My favorite fruits are oranges and apples"
* "The sky turned orange"

Same as in "Lab 4 Natural Language Processing", we use the BERT model and tokenizer from the Huggingface transformer library ([1](https://huggingface.co/course/chapter1/1), [2](https://huggingface.co/docs/transformers/quicktour))
"""

!pip install -q transformers

from transformers import BertTokenizer, TFBertModel

"""We use the 'bert-base-cased' from Huggingface as the underlying BERT model and the associated tokenizer."""

bert_model = TFBertModel.from_pretrained('bert-base-cased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

example_sentences = ["I love eating oranges",
                     "My favorite fruits are oranges and apples",
                     "The sky turned orange"]

"""Let us start by tokenizing the example sentences. """

# Check how Bert tokenize each sentence
# This helps us identify the location of 'orange' in the tokenized vector
for sen in example_sentences:
  print(bert_tokenizer.tokenize(sen))

"""Notice that the prefix '##' indicates that the token is a continuation of the previous one. This also helps us identify location of 'orange' in the tokenized vector, e.g., 'orange' is the 4th token in the first sentence. Note that here the tokenize() function just splits a text into words, and doesn't add a 'CLS' (classification token) or a 'SEP' (separation token) to the text.

Next, we use the tokenizer to transfer the example sentences to input that the Bert model expects. 
"""

bert_inputs = bert_tokenizer(example_sentences,
                             padding=True,
                             return_tensors='tf')

bert_inputs

"""So there are actually three outputs: the input ids (starting with '101' for the '[CLS]' token), the token_type_ids which are usefull when one has distinct segments, and the attention masks which are used to mask out padding tokens. 

Please refer to our Lab 4 for more details about input_ids, token_type_ids, and attention_masks. 

More resources:
*    https://huggingface.co/docs/transformers/preprocessing
*    https://huggingface.co/docs/transformers/tokenizer_summary

Now, let us get the BERT encoding of our example sentences.
"""

bert_outputs = bert_model(bert_inputs)

print('shape of first output: \t\t', bert_outputs[0].shape)
print('shape of second output: \t', bert_outputs[1].shape)

"""There are two outputs here: one with dimensions [3, 10, 768] and one with [3, 768]. The first one [batch_size, sequence_length, embedding_size] is the output of the last layer of the Bert model and are the contextual embeddings of the words in the input sequence. The second output [batch_size, embedding_size] is the embedding of the first token of the sequence (i.e., classification token).

Note you can also get the first output through bert_output.last_hidden_state (see below, also check https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert#transformers.TFBertModel)

We need the first output to get contextualized embeddings for 'orange' in each sentence.
"""

bert_outputs[0]

bert_outputs.last_hidden_state

"""Now, we get the embeddings of 'orange' in each sentence by simply finding the 'orange'-token positions in the embedding output and extract the proper components:"""

orange_1 = bert_outputs[0][0, 4] 
orange_2 = bert_outputs[0][1, 5]
orange_3 = bert_outputs[0][2, 4]

oranges = [orange_1, orange_2, orange_3]

"""We calculate pair-wise cosine similarities:"""

def cosine_similarities(vecs):
    for v_1 in vecs:
        similarities = ''
        for v_2 in vecs:
            similarities += ('\t' + str(np.dot(v_1, v_2)/
                np.sqrt(np.dot(v_1, v_1) * np.dot(v_2, v_2)))[:4])
        print(similarities)

cosine_similarities(oranges)

"""The similarity metrics make sense. The 'orange' in "The sky turned orange" is different from the rest.

Next, please compare the contextual embedding vectors of 'bank' in the following four sentences:


*   "I need to bring my money to the bank today"
*   "I will need to bring my money to the bank tomorrow"
*   "I had to bank into a turn"
*   "The bank teller was very nice"


**Inline Question #1:** 

- Please calculate the pair-wise cosine similarities between 'bank' in the four sentences and fill in the table below. (Note, bank_i represent bank in the i_th sentence)
- Please explain the results. Does it make sense?

**Your Answer:**

| `similarity`|  bank_1  |  bank_2  |  bank_3  |  bank_4  |
|-------------|----------|----------|----------|----------|
| bank_1      |     1.0|	0.99	|0.59	|0.86|
| bank_2      |    0.99  |      1.0  |   0.59       |  0.87|
| bank_3      |    0.59 |0.59|   1.0       |      0.62    |
| bank_4      |    0.86      |     0.87     |    0.62 |  1.0   |
"""

################################################################################
# TODO: Fill in your codes                                                     #                                                              #
################################################################################

bank_sentences = ["I need to bring my money to the bank today",
                     "I will need to bring my money to the bank tomorrow",
                     "I had to bank into a turn","The bank teller was very nice"]

for bank in bank_sentences:
  print(bert_tokenizer.tokenize(bank))

bert_inputs2 = bert_tokenizer(bank_sentences,
                             padding=True,
                             return_tensors='tf')

bert_inputs2

bert_outputs2 = bert_model(bert_inputs2)

print('shape of first output: \t\t', bert_outputs2[0].shape)
print('shape of second output: \t', bert_outputs2[1].shape)

bert_outputs2[0]

bert_outputs2.last_hidden_state

bank_1 = bert_outputs2[0][0, 9] 
bank_2 = bert_outputs2[0][1, 10]
bank_3 = bert_outputs2[0][2, 4]
bank_4 = bert_outputs2[0][3, 2]

banks = [bank_1, bank_2, bank_3, bank_4]

cosine_similarities(banks)

"""## Part III Text classification

In this part, you will build text classifiers that try to infer whether tweets from [@realDonaldTrump](https://twitter.com/realDonaldTrump) were written by Trump himself or by a staff person.
This is an example of binary classification on a text dataset. 

It is known that Donald Trump uses an Android phone, and it has been observed that some of his tweets come from Android while others come from other devices (most commonly iPhone). It is widely believed that Android tweets are written by Trump himself, while iPhone tweets are written by other staff. For more information, you can read this [blog post by David Robinson](http://varianceexplained.org/r/trump-tweets/), written prior to the 2016 election, which finds a number of differences in the style and timing of tweets published under these two devices. (Some tweets are written from other devices, but for simplicity the dataset for this assignment is restricted to these two.)

This is a classification task known as "authorship attribution", which is the task of inferring the author of a document when the authorship is unknown. We will see how accurately this can be done with linear classifiers using word features.

You might find it familiar: Yes! We are using the same data set as your homework 2 from MSBC 5180.

### Tasks

In this section, you will build two text classifiers: one with a traditional machine learning method that you studied in MSBC.5190 and one with a deep learning method.


*   For the first classifier, you can use any non-deep learning based methods. You can use your solution to Homework 2 of MSBC 5180 here.
*   For the second classifier, you may try the following methods
    *    Fine-tune BERT (similar to our lab 4 Fine-tune BERT for Sentiment Analysis)
    *    Use pre-trained word embedding (useful to check: https://keras.io/examples/nlp/pretrained_word_embeddings/)
    *    Train a deep neural network (e.g., RNN, Bi-LSTM) from scratch (similar to notebooks from our textbook: https://github.com/the-deep-learners/deep-learning-illustrated/blob/master/notebooks/bi_lstm_sentiment_classifier.ipynb)

You may want to split the current training data to train and validation to help model selection.

### Load the Data Set

#### Sample code to load raw text###

Please download `tweets.train.tsv` and `tweets.test.tsv` from Canvas (Module Assignment) and upload them to Google Colab. Here we load raw text data to text_train and text_test.
"""

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

#training set
df_train = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.train.tsv', sep='\t', header=None)

text_train = df_train.iloc[0:, 1].values.tolist()
Y_train = df_train.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])

df_test = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
text_test = df_test.iloc[0:, 1].values.tolist()
Y_test = df_test.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])

from google.colab import drive
drive.mount('/content/drive')

"""Let us take a quick look of some training examples"""

text_train[:5]

y_train[:5]

"""#### Sample code to preprocess data for BERT ####

The pre-processing step is similar to Lab 4. 

Feel free to dispose it if you want to preprocess the data differently and use methods other than BERT.
"""

# The longest text in the data is 75 and we use it as the max_length
max_length = 75
x_train = bert_tokenizer(text_train,
              max_length=75,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')

y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])

x_test = bert_tokenizer(text_test,
              max_length=75,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')

y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])

"""### Your Solution 1: A traditional machine learning approach (30%)

Please implement your text classifier using a traditional machine learning method. 


**Inline Question #1:** 
- What machine leaning model did you use?
- What are the features used in this model?
- What is the model's performance in the test data?

**Your Answer:**
The machine learning model used for this model is Naive Bayes algorithm for this text classification task.

The features used in this model are the token counts of the words in the text data. The CountVectorizer() class from scikit-learn was used to convert the text data into a matrix of token counts, which was used as input to the Naive Bayes classifier.

The accuracy score for this model is 0.859.
"""

################################################################################
# TODO: Fill in your codes                                                     #
################################################################################

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# training set
df_train = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.train.tsv', sep='\t', header=None)
text_train = df_train.iloc[0:, 1].values.tolist()
Y_train = df_train.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])

# test set
df_test = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
text_test = df_test.iloc[0:, 1].values.tolist()
Y_test = df_test.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])

# vectorize text data
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train)
X_test = vectorizer.transform(text_test)

# train Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# make predictions on test set
y_pred = clf.predict(X_test)

# calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

"""### Your Solution 2: A deep learning apporach (30%)

Please implement your text classifier using a deep learning method

**Inline Question #1:** 
- What deep leaning model did you use?
- Please briefly explain the input, output, and layers (e.g., what does each layer do) of your model. 
- What is the model's performance in the test data?
- Is it better or worse than Solution 1? What might be the cause?

**Your Answer:**
The model's performance on the test data was 0.82, which is better than solution 1's model.

A potential cause of this is that Deep learning methods are more accurate than machine learning methods because deep learning programming can create complex statistical models directly from its own iterative output. 
"""

################################################################################
# TODO: Fill in your codes                                                     #
################################################################################

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Dropout
from keras.callbacks import EarlyStopping
import pandas as pd
import numpy as np
df_train = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
df_test = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
# Set hyperparameters
max_nb_words = 50000
max_sequence_length = 250
epochs = 5
batch_size = 64
embedding_dim = 100

tokenizer = Tokenizer(num_words=max_nb_words, lower=True, oov_token='<OOV>')
tokenizer.fit_on_texts(df_train.iloc[:,1].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

X_train = tokenizer.texts_to_sequences(df_train.iloc[:,1].values)
X_train = pad_sequences(X_train, maxlen=max_sequence_length)
y_train = np.array([1 if v == 'Android' else 0 for v in df_train.iloc[:,0].values])

X_test = tokenizer.texts_to_sequences(df_test.iloc[:,1].values)
X_test = pad_sequences(X_test, maxlen=max_sequence_length)
y_test = np.array([1 if v == 'Android' else 0 for v in df_test.iloc[:,0].values])

# Model
model = Sequential()
model.add(Embedding(max_nb_words, embedding_dim, input_length=X_train.shape[1]))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# Train 
early_stopping = EarlyStopping(monitor='val_loss', patience=2, verbose=0, mode='auto')
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1, callbacks=[early_stopping])

# Accuracy
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

"""## Final note (10%)

Similar to Homework 1, 10% of the total grade is allocated based on model performance. Teams with higher performance scores (max of solution 1 and solution 2) get higher grade. 


"""

def create_classification_model(hidden_size = 200, 
                                train_layers = -1, 
                                optimizer=tf.keras.optimizers.Adam()):
    """
    Build a simple classification model with BERT. Let's keep it simple and 
    don't add dropout, layer norms, etc.
    """

    input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, 
                                      name='input_ids_layer')
    token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, 
                                           name='token_type_ids_layer')
    attention_mask = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, 
                                           name='attention_mask_layer')

    bert_inputs = {'input_ids': input_ids,
                  'token_type_ids': token_type_ids,
                  'attention_mask': attention_mask}

    #restrict training to the train_layers outer transformer layers
    if not train_layers == -1:
            retrain_layers = []
            for retrain_layer_number in range(train_layers):
                layer_code = '_' + str(11 - retrain_layer_number)
                retrain_layers.append(layer_code)
            for w in bert_model.weights:
                if not any([x in w.name for x in retrain_layers]):
                    w._trainable = False

    bert_out = bert_model(bert_inputs)
    classification_token = tf.keras.layers.Lambda(
        lambda x: x[:,0,:], name='get_first_vector')(bert_out[0])

    hidden = tf.keras.layers.Dense(
        hidden_size, name='hidden_layer')(classification_token)

    classification = tf.keras.layers.Dense(
        1, activation='sigmoid',name='classification_layer')(hidden)

    classification_model = tf.keras.Model(
        inputs=[input_ids, token_type_ids, attention_mask], 
        outputs=[classification])
    
    classification_model.compile(
        optimizer=optimizer,
        loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
        metrics='accuracy')

    return classification_model

classification_model = create_classification_model()

classification_model.fit([x_train.input_ids, 
                          x_train.token_type_ids, 
                          x_train.attention_mask],
                         y_train,
                         validation_data=([x_test.input_ids, 
                                           x_test.token_type_ids, 
                                           x_test.attention_mask], y_test),
                         epochs=5,
                         batch_size=8)

classification_model.predict([x_train.input_ids, 
                              x_train.token_type_ids, 
                              x_train.attention_mask], 
                             batch_size=8, 
                             steps=2)

try:
    del classification_model
except:
    pass

try:
    del bert_model
except:
    pass

tf.keras.backend.clear_session()
bert_model = TFBertModel.from_pretrained('bert-base-cased')

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", 
                                                             num_labels=2)
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
          y_train,
          validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask], 
                           y_test),
          epochs=3, 
          batch_size=8)

model_alternative = TFAutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased", num_labels=1)
model_alternative.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
              loss='binary_crossentropy', 
              metrics=['accuracy']
)

model_alternative.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
          y_train,
          validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask], 
                           y_test),
          epochs=3, 
          batch_size=8)

y_train