-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathword_embedding_and_text_classification.py
680 lines (470 loc) · 27.5 KB
/
word_embedding_and_text_classification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
# -*- coding: utf-8 -*-
"""Word Embedding and Text Classification.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1awWwqM79EEHfcHblS3i7N3aGZa4WC2dO
# Homework 2, MSBC.5190 Modern Artificial Intelligence S23
**Teammates: Jozie Wille, Lexi Dingle, Charlie Rudy**
**Teamname: XX**
Handout 03/10/2022 4pm, **due 03/24/2022 by 4pm**. Please submit through Canvas. Each team only needs to submit one copy.
Important information about submission:
- Write all code, text (answers), and figures in the notebook.
- Please make sure that the submitted notebook has been run and the cell outputs are visible.
- Please print the notebook as PDF and submit it together with the notebook. Your submission should contain two files: `homework2-teamname.ipynb` and `homework2-teamname.pdf`
The goal of the homework are three folds:
1. Explore word embedding
2. Understand contextual word embedding using BERT
3. Text classificaiton with both traditional machine learning methods and deep learning methods
**A note about GPU**: You'd better use GPU to run it, otherwise it will be quite slow to train deep learning models.
First, import the packages or modules required for this homework.
"""
!pip install gensim~=3.8.3
################################################################################
# TODO: Fill in your codes #
# Import packages or modules #
################################################################################
import tensorflow as tf
import numpy as np
from tensorflow import keras
"""## Part I: Explore Word Embedding (15%)
Word embeddings are useful representation of words that capture information about word meaning as well as location. They are used as a fundamental component for downstream NLP tasks, e.g., text classification. In this part, we will explore the embeddings produced by [GloVe (global vectors for word representation)](https://nlp.stanford.edu/projects/glove/). It is simlar to Word2Vec but differs in their underlying methodology: in GloVe, word embeddings are learned based on global word-word co-occurrence statistics. Both Word2Vec and GloVe tend to produce vector-space embeddings that perform similarly in downstream NLP tasks.
We first load the GloVe vectors
"""
import gensim.downloader as api
# download the model and return as object ready for use
glove_model = api.load('glove-wiki-gigaword-100')
#load the word vectors from the model
glove_word_vectors = glove_model.wv
"""Take a look at the vocabulary size and dimensionality of the embedding space"""
print('vocabulary size = ', len(glove_word_vectors.vocab))
print('embedding dimensionality = ', glove_word_vectors['happy'].shape)
"""
What is embedding exactly?"""
# Check word embedding for 'happy'
# You can access the embedding of a word with glove_word_vectors[word] if word
# is in the vocabulary
glove_word_vectors['happy']
"""With word embeddings learned from GloVe or Word2Vec, words with similar semantic meanings tend to have vectors that are close together. Please code and calculate the **cosine similarities** between words based on their embeddings (i.e., word vectors).
For each of the following words in occupation, compute its cosine similarty to 'woman' and its similarity to 'man' and check which gender is more similar.
*occupation = {homemaker, nurse, receptionist, librarian, socialite, hairdresser, nanny, bookkeeper, stylist, housekeeper, maestro, skipper, protege, philosopher, captain, architect, financier, warrior, broadcaster, magician}*
**Inline Question #1:**
- Fill in the table below with cosine similarities between words in occupation list and {woman, man}. Please show only two digits after decimal.
- Which words are more similar to 'woman' than to 'man'?
Words that are significantly more simliar to woman are homemaker, nurse, receptionist, socialite, hairdresser, housekeeper
- Which words are more similar to 'man' than to 'woman'?
Words that are more similar to man than woman are maestro, captain, architect, financier, warrior, magician
- Do you see any issue here? What do you think might cause these issues?
there could be issues here becuase the words and their significance play to gender stereotypes. We think these issues could come from the material the algoritm is learning on, if the material is from a time where these stereotypes were heavily enforced it could produce these words to be skewed.
**Your Answer:**
| `similarity`| woman | man |
|-------------|-----------|--------------|
| homemaker | 0.43 | 0.24 |
| nurse | 0.61 | 0.46 |
| receptionist| 0.33 | 0.19 |
| librarian | 0.34 | 0.23 |
| socialite | 0.42 | 0.27 |
| hairdresser | 0.39 | 0.26 |
| nanny | 0.36 | 0.29 |
| bookkeeper | 0.21 | 0.14 |
| stylist | 0.31 | 0.25 |
| housekeeper | 0.46 | 0.31 |
| maestro | -0.015 | 0.14 |
| skipper | 0.15 | 0.34 |
| protege | 0.12 | 0.20 |
| philosopher | 0.23 | 0.28 |
| captain | 0.31 | 0.53 |
| architect | 0.22 | 0.30 |
| financier | 0.14 | 0.26 |
| warrior | 0.39 | 0.51 |
| broadcaster | 0.23 | 0.25 |
| magician | 0.27 | 0.38 |
"""
glove_model.similarity(w1='woman', w2='magician')
glove_model.similarity(w1='man', w2='magician')
glove_model.similarity(w1='woman', w2='broadcaster')
glove_model.similarity(w1='man', w2='broadcaster')
glove_model.similarity(w1='man', w2='warrior')
glove_model.similarity(w1='woman', w2='warrior')
glove_model.similarity(w1='woman', w2='financier')
glove_model.similarity(w1='man', w2='financier')
glove_model.similarity(w1='woman', w2='architect')
glove_model.similarity(w1='man', w2='architect')
glove_model.similarity(w1='man', w2='captain')
glove_model.similarity(w1='woman', w2='captain')
glove_model.similarity(w1='woman', w2='philosopher')
glove_model.similarity(w1='man', w2='philosopher')
glove_model.similarity(w1='man', w2='protege')
glove_model.similarity(w1='woman', w2='protege')
glove_model.similarity(w1='man', w2='skipper')
glove_model.similarity(w1='woman', w2='skipper')
glove_model.similarity(w1='man', w2='maestro')
glove_model.similarity(w1='woman', w2='maestro')
glove_model.similarity(w1='man', w2='housekeeper')
glove_model.similarity(w1='woman', w2='housekeeper')
################################################################################
# TODO: Fill in your codes # #
################################################################################
glove_model.similarity(w1='woman', w2='homemaker')
glove_model.similarity(w1='man', w2='homemaker')
glove_model.similarity(w1='man', w2='nurse')
glove_model.similarity(w1='woman', w2='nurse')
glove_model.similarity(w1='man', w2='receptionist')
glove_model.similarity(w1='woman', w2='receptionist')
glove_model.similarity(w1='man', w2='librarian')
glove_model.similarity(w1='woman', w2='librarian')
glove_model.similarity(w1='man', w2='socialite')
glove_model.similarity(w1='woman', w2='socialite')
glove_model.similarity(w1='man', w2='hairdresser')
glove_model.similarity(w1='woman', w2='hairdresser')
glove_model.similarity(w1='man', w2='nanny')
glove_model.similarity(w1='woman', w2='nanny')
glove_model.similarity(w1='man', w2='bookkeeper')
glove_model.similarity(w1='woman', w2='bookkeeper')
glove_model.similarity(w1='woman', w2='stylist')
glove_model.similarity(w1='man', w2='stylist')
"""## Part II Understand contextual word embedding using BERT (15%)
A big difference between Word2Vec and BERT is that Word2Vec learns context-free word representations, i.e., the embedding for 'orange' is the same in "I love eating oranges" and in "The sky turned orange". BERT, on the other hand, produces contextual word presentations, i.e., embeddings for the same word in different contexts should be different.
For example, let us compare the context-based embedding vectors for 'orange' in the following three sentences using Bert:
* "I love eating oranges"
* "My favorite fruits are oranges and apples"
* "The sky turned orange"
Same as in "Lab 4 Natural Language Processing", we use the BERT model and tokenizer from the Huggingface transformer library ([1](https://huggingface.co/course/chapter1/1), [2](https://huggingface.co/docs/transformers/quicktour))
"""
!pip install -q transformers
from transformers import BertTokenizer, TFBertModel
"""We use the 'bert-base-cased' from Huggingface as the underlying BERT model and the associated tokenizer."""
bert_model = TFBertModel.from_pretrained('bert-base-cased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
example_sentences = ["I love eating oranges",
"My favorite fruits are oranges and apples",
"The sky turned orange"]
"""Let us start by tokenizing the example sentences. """
# Check how Bert tokenize each sentence
# This helps us identify the location of 'orange' in the tokenized vector
for sen in example_sentences:
print(bert_tokenizer.tokenize(sen))
"""Notice that the prefix '##' indicates that the token is a continuation of the previous one. This also helps us identify location of 'orange' in the tokenized vector, e.g., 'orange' is the 4th token in the first sentence. Note that here the tokenize() function just splits a text into words, and doesn't add a 'CLS' (classification token) or a 'SEP' (separation token) to the text.
Next, we use the tokenizer to transfer the example sentences to input that the Bert model expects.
"""
bert_inputs = bert_tokenizer(example_sentences,
padding=True,
return_tensors='tf')
bert_inputs
"""So there are actually three outputs: the input ids (starting with '101' for the '[CLS]' token), the token_type_ids which are usefull when one has distinct segments, and the attention masks which are used to mask out padding tokens.
Please refer to our Lab 4 for more details about input_ids, token_type_ids, and attention_masks.
More resources:
* https://huggingface.co/docs/transformers/preprocessing
* https://huggingface.co/docs/transformers/tokenizer_summary
Now, let us get the BERT encoding of our example sentences.
"""
bert_outputs = bert_model(bert_inputs)
print('shape of first output: \t\t', bert_outputs[0].shape)
print('shape of second output: \t', bert_outputs[1].shape)
"""There are two outputs here: one with dimensions [3, 10, 768] and one with [3, 768]. The first one [batch_size, sequence_length, embedding_size] is the output of the last layer of the Bert model and are the contextual embeddings of the words in the input sequence. The second output [batch_size, embedding_size] is the embedding of the first token of the sequence (i.e., classification token).
Note you can also get the first output through bert_output.last_hidden_state (see below, also check https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/bert#transformers.TFBertModel)
We need the first output to get contextualized embeddings for 'orange' in each sentence.
"""
bert_outputs[0]
bert_outputs.last_hidden_state
"""Now, we get the embeddings of 'orange' in each sentence by simply finding the 'orange'-token positions in the embedding output and extract the proper components:"""
orange_1 = bert_outputs[0][0, 4]
orange_2 = bert_outputs[0][1, 5]
orange_3 = bert_outputs[0][2, 4]
oranges = [orange_1, orange_2, orange_3]
"""We calculate pair-wise cosine similarities:"""
def cosine_similarities(vecs):
for v_1 in vecs:
similarities = ''
for v_2 in vecs:
similarities += ('\t' + str(np.dot(v_1, v_2)/
np.sqrt(np.dot(v_1, v_1) * np.dot(v_2, v_2)))[:4])
print(similarities)
cosine_similarities(oranges)
"""The similarity metrics make sense. The 'orange' in "The sky turned orange" is different from the rest.
Next, please compare the contextual embedding vectors of 'bank' in the following four sentences:
* "I need to bring my money to the bank today"
* "I will need to bring my money to the bank tomorrow"
* "I had to bank into a turn"
* "The bank teller was very nice"
**Inline Question #1:**
- Please calculate the pair-wise cosine similarities between 'bank' in the four sentences and fill in the table below. (Note, bank_i represent bank in the i_th sentence)
- Please explain the results. Does it make sense?
**Your Answer:**
| `similarity`| bank_1 | bank_2 | bank_3 | bank_4 |
|-------------|----------|----------|----------|----------|
| bank_1 | 1.0| 0.99 |0.59 |0.86|
| bank_2 | 0.99 | 1.0 | 0.59 | 0.87|
| bank_3 | 0.59 |0.59| 1.0 | 0.62 |
| bank_4 | 0.86 | 0.87 | 0.62 | 1.0 |
"""
################################################################################
# TODO: Fill in your codes # #
################################################################################
bank_sentences = ["I need to bring my money to the bank today",
"I will need to bring my money to the bank tomorrow",
"I had to bank into a turn","The bank teller was very nice"]
for bank in bank_sentences:
print(bert_tokenizer.tokenize(bank))
bert_inputs2 = bert_tokenizer(bank_sentences,
padding=True,
return_tensors='tf')
bert_inputs2
bert_outputs2 = bert_model(bert_inputs2)
print('shape of first output: \t\t', bert_outputs2[0].shape)
print('shape of second output: \t', bert_outputs2[1].shape)
bert_outputs2[0]
bert_outputs2.last_hidden_state
bank_1 = bert_outputs2[0][0, 9]
bank_2 = bert_outputs2[0][1, 10]
bank_3 = bert_outputs2[0][2, 4]
bank_4 = bert_outputs2[0][3, 2]
banks = [bank_1, bank_2, bank_3, bank_4]
cosine_similarities(banks)
"""## Part III Text classification
In this part, you will build text classifiers that try to infer whether tweets from [@realDonaldTrump](https://twitter.com/realDonaldTrump) were written by Trump himself or by a staff person.
This is an example of binary classification on a text dataset.
It is known that Donald Trump uses an Android phone, and it has been observed that some of his tweets come from Android while others come from other devices (most commonly iPhone). It is widely believed that Android tweets are written by Trump himself, while iPhone tweets are written by other staff. For more information, you can read this [blog post by David Robinson](http://varianceexplained.org/r/trump-tweets/), written prior to the 2016 election, which finds a number of differences in the style and timing of tweets published under these two devices. (Some tweets are written from other devices, but for simplicity the dataset for this assignment is restricted to these two.)
This is a classification task known as "authorship attribution", which is the task of inferring the author of a document when the authorship is unknown. We will see how accurately this can be done with linear classifiers using word features.
You might find it familiar: Yes! We are using the same data set as your homework 2 from MSBC 5180.
### Tasks
In this section, you will build two text classifiers: one with a traditional machine learning method that you studied in MSBC.5190 and one with a deep learning method.
* For the first classifier, you can use any non-deep learning based methods. You can use your solution to Homework 2 of MSBC 5180 here.
* For the second classifier, you may try the following methods
* Fine-tune BERT (similar to our lab 4 Fine-tune BERT for Sentiment Analysis)
* Use pre-trained word embedding (useful to check: https://keras.io/examples/nlp/pretrained_word_embeddings/)
* Train a deep neural network (e.g., RNN, Bi-LSTM) from scratch (similar to notebooks from our textbook: https://github.com/the-deep-learners/deep-learning-illustrated/blob/master/notebooks/bi_lstm_sentiment_classifier.ipynb)
You may want to split the current training data to train and validation to help model selection.
### Load the Data Set
#### Sample code to load raw text###
Please download `tweets.train.tsv` and `tweets.test.tsv` from Canvas (Module Assignment) and upload them to Google Colab. Here we load raw text data to text_train and text_test.
"""
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
#training set
df_train = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.train.tsv', sep='\t', header=None)
text_train = df_train.iloc[0:, 1].values.tolist()
Y_train = df_train.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])
df_test = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
text_test = df_test.iloc[0:, 1].values.tolist()
Y_test = df_test.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])
from google.colab import drive
drive.mount('/content/drive')
"""Let us take a quick look of some training examples"""
text_train[:5]
y_train[:5]
"""#### Sample code to preprocess data for BERT ####
The pre-processing step is similar to Lab 4.
Feel free to dispose it if you want to preprocess the data differently and use methods other than BERT.
"""
# The longest text in the data is 75 and we use it as the max_length
max_length = 75
x_train = bert_tokenizer(text_train,
max_length=75,
truncation=True,
padding='max_length',
return_tensors='tf')
y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])
x_test = bert_tokenizer(text_test,
max_length=75,
truncation=True,
padding='max_length',
return_tensors='tf')
y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])
"""### Your Solution 1: A traditional machine learning approach (30%)
Please implement your text classifier using a traditional machine learning method.
**Inline Question #1:**
- What machine leaning model did you use?
- What are the features used in this model?
- What is the model's performance in the test data?
**Your Answer:**
The machine learning model used for this model is Naive Bayes algorithm for this text classification task.
The features used in this model are the token counts of the words in the text data. The CountVectorizer() class from scikit-learn was used to convert the text data into a matrix of token counts, which was used as input to the Naive Bayes classifier.
The accuracy score for this model is 0.859.
"""
################################################################################
# TODO: Fill in your codes #
################################################################################
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# training set
df_train = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.train.tsv', sep='\t', header=None)
text_train = df_train.iloc[0:, 1].values.tolist()
Y_train = df_train.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_train = np.array([1 if v == 'Android' else 0 for v in Y_train])
# test set
df_test = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
text_test = df_test.iloc[0:, 1].values.tolist()
Y_test = df_test.iloc[0:, 0].values
# convert to binary labels (0 and 1)
y_test = np.array([1 if v == 'Android' else 0 for v in Y_test])
# vectorize text data
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(text_train)
X_test = vectorizer.transform(text_test)
# train Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)
# make predictions on test set
y_pred = clf.predict(X_test)
# calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
"""### Your Solution 2: A deep learning apporach (30%)
Please implement your text classifier using a deep learning method
**Inline Question #1:**
- What deep leaning model did you use?
- Please briefly explain the input, output, and layers (e.g., what does each layer do) of your model.
- What is the model's performance in the test data?
- Is it better or worse than Solution 1? What might be the cause?
**Your Answer:**
The model's performance on the test data was 0.82, which is better than solution 1's model.
A potential cause of this is that Deep learning methods are more accurate than machine learning methods because deep learning programming can create complex statistical models directly from its own iterative output.
"""
################################################################################
# TODO: Fill in your codes #
################################################################################
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Dropout
from keras.callbacks import EarlyStopping
import pandas as pd
import numpy as np
df_train = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
df_test = pd.read_csv('/content/drive/MyDrive/ModernAI/tweets.test.tsv', sep='\t', header=None)
# Set hyperparameters
max_nb_words = 50000
max_sequence_length = 250
epochs = 5
batch_size = 64
embedding_dim = 100
tokenizer = Tokenizer(num_words=max_nb_words, lower=True, oov_token='<OOV>')
tokenizer.fit_on_texts(df_train.iloc[:,1].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X_train = tokenizer.texts_to_sequences(df_train.iloc[:,1].values)
X_train = pad_sequences(X_train, maxlen=max_sequence_length)
y_train = np.array([1 if v == 'Android' else 0 for v in df_train.iloc[:,0].values])
X_test = tokenizer.texts_to_sequences(df_test.iloc[:,1].values)
X_test = pad_sequences(X_test, maxlen=max_sequence_length)
y_test = np.array([1 if v == 'Android' else 0 for v in df_test.iloc[:,0].values])
# Model
model = Sequential()
model.add(Embedding(max_nb_words, embedding_dim, input_length=X_train.shape[1]))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
# Train
early_stopping = EarlyStopping(monitor='val_loss', patience=2, verbose=0, mode='auto')
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1, callbacks=[early_stopping])
# Accuracy
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
"""## Final note (10%)
Similar to Homework 1, 10% of the total grade is allocated based on model performance. Teams with higher performance scores (max of solution 1 and solution 2) get higher grade.
"""
def create_classification_model(hidden_size = 200,
train_layers = -1,
optimizer=tf.keras.optimizers.Adam()):
"""
Build a simple classification model with BERT. Let's keep it simple and
don't add dropout, layer norms, etc.
"""
input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32,
name='input_ids_layer')
token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32,
name='token_type_ids_layer')
attention_mask = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32,
name='attention_mask_layer')
bert_inputs = {'input_ids': input_ids,
'token_type_ids': token_type_ids,
'attention_mask': attention_mask}
#restrict training to the train_layers outer transformer layers
if not train_layers == -1:
retrain_layers = []
for retrain_layer_number in range(train_layers):
layer_code = '_' + str(11 - retrain_layer_number)
retrain_layers.append(layer_code)
for w in bert_model.weights:
if not any([x in w.name for x in retrain_layers]):
w._trainable = False
bert_out = bert_model(bert_inputs)
classification_token = tf.keras.layers.Lambda(
lambda x: x[:,0,:], name='get_first_vector')(bert_out[0])
hidden = tf.keras.layers.Dense(
hidden_size, name='hidden_layer')(classification_token)
classification = tf.keras.layers.Dense(
1, activation='sigmoid',name='classification_layer')(hidden)
classification_model = tf.keras.Model(
inputs=[input_ids, token_type_ids, attention_mask],
outputs=[classification])
classification_model.compile(
optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
metrics='accuracy')
return classification_model
classification_model = create_classification_model()
classification_model.fit([x_train.input_ids,
x_train.token_type_ids,
x_train.attention_mask],
y_train,
validation_data=([x_test.input_ids,
x_test.token_type_ids,
x_test.attention_mask], y_test),
epochs=5,
batch_size=8)
classification_model.predict([x_train.input_ids,
x_train.token_type_ids,
x_train.attention_mask],
batch_size=8,
steps=2)
try:
del classification_model
except:
pass
try:
del bert_model
except:
pass
tf.keras.backend.clear_session()
bert_model = TFBertModel.from_pretrained('bert-base-cased')
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased",
num_labels=2)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy(),
)
model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
y_train,
validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask],
y_test),
epochs=3,
batch_size=8)
model_alternative = TFAutoModelForSequenceClassification.from_pretrained(
"bert-base-cased", num_labels=1)
model_alternative.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
loss='binary_crossentropy',
metrics=['accuracy']
)
model_alternative.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
y_train,
validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask],
y_test),
epochs=3,
batch_size=8)
y_train