Dane nieustruktyryzowane

Dane nieustrukturyzowane to dane, które nie są w żaden sposób uporządkowane.

  1. obrazy
  2. teksty
  3. dźwięk
  4. wideo

Niezależnie od typu wszystko przetwarzamy w tensorach (macierzach wielowymiarowych). Może to prowadzić do chęci używania modeli ML i sieci neuronowych do analizy danych nieustrukturyzowanych.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", palette="husl")

# 2-dim picture 28 x 28 pixel
picture_2d = np.random.uniform(size=(28,28))
array([[0.75586744, 0.84583516, 0.31174781, 0.56986022, 0.54521732],
       [0.4509245 , 0.12609184, 0.5827946 , 0.4015805 , 0.90850987],
       [0.59563304, 0.52120978, 0.7381324 , 0.73338458, 0.69635556],
       [0.91483631, 0.03408766, 0.58924087, 0.74936144, 0.45536234],
       [0.24416936, 0.95129512, 0.31760295, 0.87146642, 0.59493202]])
plt.imshow(picture_2d, interpolation='nearest')

jak radzić sobie z obrazami - PyTorch

import urllib.request
url = 'https://pytorch.tips/coffee'
fpath = 'coffee.jpg'
# pobierz na dysk
urllib.request.urlretrieve(url, fpath)
('coffee.jpg', <http.client.HTTPMessage at 0xffff578a6c10>)
import matplotlib.pyplot as plt
from PIL import Image # pillow library
img = Image.open('coffee.jpg')

gotowy model dla klasyfikacji obrazów

import torch
from torchvision import transforms

Odrobinę zmienimy własności obrazka

transform = transforms.Compose([
    mean = [0.485, 0.456, 0.406],
    std = [0.229, 0.224,0.225])
img_tensor = transform(img)

Sprawdzmy rozmiary

print(type(img_tensor), img_tensor.shape)
<class 'torch.Tensor'> torch.Size([3, 224, 224])
# utworzenie batch size - dodatkowego wymiaru (na inne obrazki)
batch = img_tensor.unsqueeze(0)
torch.Size([1, 3, 224, 224])

Załadujmy gotowy model

from torchvision import models 
model = models.alexnet(pretrained=True)
/home/jovyan/.local/lib/python3.11/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
/home/jovyan/.local/lib/python3.11/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /home/jovyan/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100%|██████████| 233M/233M [00:09<00:00, 24.6MB/s] 

Napiszmy uniwersalny kod, który możesz uruchomić na GPU i CPU

device = "cuda" if torch.cuda.is_available() else "cpu"
y = model(batch.to(device))
torch.Size([1, 1000])
y_max, index = torch.max(y,1)
print(index, y_max)
tensor([967]) tensor([22.8618], grad_fn=<MaxBackward0>)
url = 'https://pytorch.tips/imagenet-labels'
fpath = 'imagenet_class_labels.txt'
urllib.request.urlretrieve(url, fpath)
('imagenet_class_labels.txt', <http.client.HTTPMessage at 0xffff433d4150>)
with open('imagenet_class_labels.txt') as f:
    classes = [line.strip() for line in f.readlines()]
967: 'espresso',
prob = torch.nn.functional.softmax(y, dim=1)[0] *100
tensor(87.9955, grad_fn=<MaxBackward1>)

jeszcze obrazki

import tensorflow as tf
from tensorflow import keras

fashion_mnist = keras.datasets.fashion_mnist # 60000 obrazow 28x28
(x_train_f, y_train_f),(x_test,y_test) = fashion_mnist.load_data()
import numpy as np
indexes = np.random.randint(0, x_train_f.shape[0], size=25)
images = x_train_f[indexes]
for i in range(len(indexes)):
    plt.subplot(5, 5,i+1)
    image = images[i]
    plt.imshow(image, cmap='gray')


x_train_f.shape, y_train_f.shape
((60000, 28, 28), (60000,))
x_valid, x_train = x_train_f[:5000]/255.0, x_train_f[5000:]/255.0
y_valid, y_train = y_train_f[:5000], y_train_f[5000:]

Przykładowy model sieci nueronowej (bez konwolucji) - czy sądzisz, że to dobre rozwiązanie?

model = keras.models.Sequential()
model.add(keras.layers.Dense(128, activation=tf.nn.relu))
model.add(keras.layers.Dense(10, activation=tf.nn.softmax))
Model: "sequential"
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
 dense (Dense)               (None, 128)               100480    
 dense_1 (Dense)             (None, 10)                1290      
Total params: 101770 (397.54 KB)
Trainable params: 101770 (397.54 KB)
Non-trainable params: 0 (0.00 Byte)
model.layers # dostęp do warstw modelu
[<keras.src.layers.reshaping.flatten.Flatten at 0xfffed877fa90>,
 <keras.src.layers.core.dense.Dense at 0xffffa0703510>,
 <keras.src.layers.core.dense.Dense at 0xffff433add90>]
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train_f, y_train_f, epochs=5, validation_data = (x_valid,y_valid))
Epoch 1/5
1875/1875 [==============================] - 4s 2ms/step - loss: 2.9354 - accuracy: 0.7053 - val_loss: 2.2067 - val_accuracy: 0.1926
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.6528 - accuracy: 0.7774 - val_loss: 2.2600 - val_accuracy: 0.1902
Epoch 3/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.5747 - accuracy: 0.8042 - val_loss: 2.2981 - val_accuracy: 0.1186
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.5401 - accuracy: 0.8168 - val_loss: 2.3316 - val_accuracy: 0.1016
Epoch 5/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.5136 - accuracy: 0.8221 - val_loss: 2.3343 - val_accuracy: 0.1118
import pandas as pd
import matplotlib.pyplot as plt


313/313 [==============================] - 0s 764us/step - loss: 0.5851 - accuracy: 0.8057
[0.585058331489563, 0.8057000041007996]
x_new = x_test[:3]
y_pr = model.predict(x_new)
1/1 [==============================] - 0s 83ms/step
array([[0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 4.880e-02,
        0.000e+00, 1.430e-01, 0.000e+00, 8.082e-01],
       [1.000e-02, 0.000e+00, 3.642e-01, 2.000e-04, 5.717e-01, 0.000e+00,
        5.390e-02, 0.000e+00, 0.000e+00, 0.000e+00],
       [0.000e+00, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00,
        0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]], dtype=float32)

A jakie inne sieci i warstwy możemy wykorzystać do analizy danych nieustrukturyzowanych?

Znajdź odpowiedź na to pytanie w dokumentacji biblioteki Keras.

Format json

Twórz i zarządzaj jsonami w połączeniu z bazą danych mongoDB. Baza ta dostępna jest jako osobny mikroserwis w Dockerze. Przed podłączeniem sprawdź jak w pliku docker-compose.yml jest skonfigurowany serwis mongoDB (user i pass).

import json
person = '{"name": "Alice", "languages": ["English", "French"]}'
person_dict = json.loads(person)

{'name': 'Alice', 'languages': ['English', 'French']}
%%file test.json
{"name": "Alice", "languages": ["English", "French"]}
Writing test.json
with open('test.json') as f:
    data = json.load(f)

{'name': 'Alice', 'languages': ['English', 'French']}
with open('person.json', 'w') as json_file:
    json.dump(person_dict, json_file)
# do połączenia używamy biblioteki pymongo
!pip install pymongo -q --user
from pymongo import MongoClient
uri = "mongodb://root:admin@mongo"
client = MongoClient(uri)
db = client['school']
students = db.students
new_students = [
    {'name': 'John', 'surname': 'Smith', 'group': '1A', 'age': 22, 'skills': ['drawing', 'skiing']},
    {'name': 'Mike', 'surname': 'Jones', 'group': '1B', 'age': 24, 'skills': ['chess', 'swimming']},
    {'name': 'Diana', 'surname': 'Williams', 'group': '2A', 'age': 28, 'skills': ['curling', 'swimming']},
    {'name': 'Samantha', 'surname': 'Brown', 'group': '1B', 'age': 21, 'skills': ['guitar', 'singing']}
InsertManyResult([ObjectId('66362867602f731cf8df3a3a'), ObjectId('66362867602f731cf8df3a3b'), ObjectId('66362867602f731cf8df3a3c'), ObjectId('66362867602f731cf8df3a3d')], acknowledged=True)
{'_id': ObjectId('66362867602f731cf8df3a3a'),
 'name': 'John',
 'surname': 'Smith',
 'group': '1A',
 'age': 22,
 'skills': ['drawing', 'skiing']}

znajdz inne metody realizujące select * from table where...

Tekst i model BoW

import pandas as pd
df_train = pd.read_csv("train.csv")
df_train = df_train.drop("index", axis=1)
                                                text  label
0  When we started watching this series on cable,...      1
1  Steve Biko was a black activist who tried to r...      1
2  My short comment for this flick is go pick it ...      1
3  As a serious horror fan, I get that certain ma...      0
4  Robert Cummings, Laraine Day and Jean Muir sta...      1
[17452 17548]
# BoW model  - wektoryzator z sklearn
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(lowercase=True, max_features=10_000, stop_words="english")

CountVectorizer(max_features=10000, stop_words='english')
# słownik i nasze zmienne ..
{'started': 8515,
 'watching': 9725,
 'series': 7957,
 'cable': 1320,
 'idea': 4488,
 'hate': 4191,
 'character': 1544,
 'hold': 4339,
 'beautifully': 892,
 'developed': 2574,
 'understand': 9375,
 'react': 7196,
 'frustration': 3737,
 'fear': 3439,
 'greed': 4020,
 'temptation': 8974,
 'way': 9736,
 'viewer': 9574,
 'experiencing': 3280,
 'christopher': 1656,
 'learning': 5199,
 'br': 1151,
 'abuse': 188,
 'physically': 6608,
 'emotionally': 3046,
 'just': 4963,
 'read': 7199,
 'newspaper': 6088,
 'women': 9880,
 'tolerate': 9134,
 'behavior': 915,
 'dream': 2831,
 'house': 4418,
 'endless': 3074,
 'supply': 8779,
 'expensive': 3276,
 'things': 9036,
 'sure': 8791,
 'loving': 5426,
 'faithful': 3371,
 'husband': 4465,
 'maybe': 5640,
 'watch': 9719,
 'doesn': 2754,
 'matter': 5630,
 'times': 9104,
 'episode': 3140,
 'missed': 5813,
 'episodes': 3141,
 'sequence': 7950,
 'season': 7869,
 'late': 5151,
 'night': 6101,
 'commercials': 1874,
 'language': 5133,
 'reruns': 7427,
 'movie': 5938,
 'network': 6077,
 've': 9529,
 'totally': 9171,
 'spoiled': 8437,
 'love': 5420,
 'neck': 6044,
 'favorite': 3431,
 'johnny': 4906,
 'boy': 1144,
 'entered': 3112,
 'family': 3386,
 'sign': 8134,
 'life': 5270,
 'ends': 3076,
 'collected': 1816,
 'dvd': 2910,
 'collection': 1818,
 'steve': 8566,
 'biko': 984,
 'black': 1014,
 'tried': 9269,
 'resist': 7440,
 'white': 9800,
 'minority': 5793,
 'south': 8365,
 'africa': 322,
 'gandhi': 3787,
 'british': 1209,
 'empire': 3054,
 'india': 4604,
 'richard': 7523,
 'attenborough': 701,
 'film': 3509,
 'freedom': 3707,
 'donald': 2773,
 'woods': 9894,
 'liberal': 5260,
 'editor': 2967,
 'trying': 9302,
 'tell': 8966,
 'story': 8605,
 'jarring': 4858,
 'point': 6709,
 'view': 9572,
 'switch': 8855,
 'dies': 2609,
 'prison': 6908,
 'hands': 4136,
 'african': 323,
 'police': 6725,
 'played': 6677,
 'kevin': 5007,
 'kline': 5057,
 'choose': 1632,
 'right': 7544,
 'thing': 9035,
 'flee': 3575,
 'country': 2133,
 'books': 1102,
 'allow': 405,
 'wife': 9816,
 'penelope': 6526,
 'pressure': 6873,
 'forgetting': 3650,
 'case': 1443,
 'vain': 9499,
 'begins': 908,
 'changing': 1535,
 'friendship': 3723,
 'standard': 8495,
 'numbers': 6182,
 'escape': 3162,
 'border': 1109,
 'yarn': 9958,
 'death': 2390,
 'oscar': 6320,
 'nominated': 6124,
 'denzel': 2497,
 'washington': 9713,
 'good': 3942,
 'fourth': 3675,
 'wrong': 9944,
 'tries': 9271,
 'depict': 2505,
 'struggles': 8653,
 'focusing': 3610,
 'trials': 9261,
 'half': 4117,
 'served': 7963,
 'topic': 9155,
 'better': 965,
 'rise': 7561,
 'instead': 4695,
 'beginning': 907,
 'actor': 241,
 'leading': 5190,
 'role': 7609,
 'hour': 4416,
 'wasn': 9714,
 'exactly': 3220,
 'big': 975,
 'box': 1140,
 'office': 6244,
 'tremendous': 9255,
 'flop': 3594,
 'politics': 6736,
 'aside': 639,
 'entertains': 3121,
 'sends': 7922,
 'message': 5722,
 'albeit': 372,
 'pg': 6580,
 'fashion': 3412,
 'stars': 8512,
 'short': 8087,
 'comment': 1866,
 'flick': 3583,
 'pick': 6613,
 'chances': 1529,
 'going': 3930,
 'positively': 6779,
 'surprised': 8801,
 'diversity': 2732,
 'elements': 3010,
 'superbly': 8768,
 'explored': 3304,
 'criminal': 2209,
 'thriller': 9066,
 'claiming': 1697,
 'pushing': 7062,
 'room': 7630,
 'possible': 6788,
 'wont': 9890,
 'push': 7059,
 'nerves': 6071,
 'edge': 2957,
 'thumbs': 9081,
 'horror': 4403,
 'fan': 3388,
 'certain': 1509,
 'marketing': 5572,
 'used': 9480,
 'sell': 7910,
 'movies': 5940,
 'especially': 3166,
 'really': 7222,
 'bad': 789,
 'ones': 6268,
 'wouldn': 9922,
 'assumed': 667,
 'ripping': 7559,
 'cannibal': 1365,
 'zombi': 9993,
 'jungle': 4957,
 'holocaust': 4352,
 'unfortunately': 9405,
 'completely': 1916,
 'hardcore': 4160,
 'realized': 7219,
 'saw': 7777,
 'odd': 6229,
 'actual': 246,
 'minor': 5792,
 'warning': 9701,
 'notice': 6156,
 'daring': 2344,
 'catch': 1461,
 'group': 4057,
 'scientists': 7822,
 'pretty': 6880,
 'led': 5207,
 'sea': 7858,
 'captain': 1383,
 'penchant': 6525,
 'beach': 876,
 'search': 7864,
 'mutated': 5982,
 'native': 6021,
 'killing': 5029,
 'villagers': 9586,
 'nuclear': 6178,
 'bomb': 1084,
 'supposedly': 8788,
 'island': 4814,
 'radiation': 7113,
 'turned': 9315,
 'man': 5521,
 'rapist': 7170,
 'killer': 5027,
 'writer': 9939,
 'george': 3845,
 'succeeds': 8717,
 'keeping': 4994,
 'clothes': 1767,
 'sex': 7984,
 'scenes': 7802,
 'whacked': 9787,
 'walk': 9661,
 'nude': 6179,
 'strange': 8613,
 'asks': 643,
 'rape': 7164,
 'turns': 9318,
 'chicks': 1610,
 'slapping': 8222,
 'naturally': 6024,
 'scene': 7800,
 'chick': 1608,
 'toss': 9168,
 'finger': 3530,
 'know': 5071,
 'rest': 7458,
 'insane': 4667,
 'oh': 6251,
 'kidding': 5017,
 'ton': 9140,
 'like': 5287,
 'pays': 6507,
 'guys': 4095,
 'tag': 8887,
 'team': 8940,
 'taking': 8894,
 'use': 9479,
 'cuts': 2293,
 'refuses': 7296,
 'advances': 295,
 'starts': 8519,
 'crying': 2254,
 'gentleman': 3839,
 'reluctantly': 7349,
 'lets': 5248,
 'pleasure': 6688,
 'crew': 2204,
 'members': 5690,
 'honestly': 4368,
 'waiting': 9655,
 'pizza': 6649,
 'guy': 4094,
 'ask': 640,
 'pay': 6503,
 'happens': 4152,
 'conduct': 1953,
 'research': 7431,
 'wait': 9653,
 'thought': 9050,
 'zombie': 9994,
 'enter': 3111,
 'mark': 5568,
 'time': 9101,
 'plenty': 6690,
 'hitting': 4330,
 'fast': 3416,
 'forward': 3670,
 'splatter': 8430,
 'porn': 6756,
 'don': 2772,
 'think': 9037,
 'does': 2753,
 'justice': 4964,
 'guess': 4072,
 'woman': 9879,
 'talking': 8905,
 'say': 7779,
 'plot': 6694,
 'hairy': 4114,
 'funny': 3759,
 'worked': 9900,
 'decent': 2404,
 'atomic': 687,
 'bombing': 1086,
 'bitter': 1011,
 'shakes': 8006,
 'head': 4213,
 'walks': 9666,
 'away': 758,
 'couple': 2138,
 'makes': 5510,
 'wonder': 9882,
 'disgusted': 2693,
 'feel': 3454,
 'sound': 8355,
 'quality': 7073,
 'guessed': 4073,
 'production': 6929,
 'shot': 8092,
 'including': 4587,
 'erotic': 3157,
 'nights': 6106,
 'living': 5344,
 'dead': 2377,
 'sports': 8450,
 'cast': 1453,
 'said': 7716,
 'wanted': 9685,
 'vacation': 9495,
 'paycheck': 6504,
 'suddenly': 8728,
 'weird': 9767,
 'speaking': 8388,
 'italian': 4824,
 'recorded': 7255,
 'english': 3092,
 'dialogue': 2591,
 'people': 6532,
 'clearly': 1724,
 'hear': 4223,
 'background': 781,
 'yes': 9970,
 'wonderful': 9884,
 'slightly': 8248,
 'amusing': 454,
 'score': 7827,
 'couldn': 2124,
 'save': 7772,
 'sfx': 7992,
 'minimal': 5785,
 'best': 959,
 'consisted': 2006,
 'blood': 1050,
 'violent': 9596,
 'bright': 1199,
 'label': 5099,
 'cover': 2150,
 'ploy': 6698,
 'presented': 6865,
 'widescreen': 9810,
 '85': 153,
 'aspect': 645,
 'ratio': 7181,
 'watched': 9721,
 'region': 7304,
 'rated': 7176,
 'version': 9549,
 'running': 7683,
 'released': 7334,
 '2005': 104,
 'exploitation': 3298,
 'digital': 2623,
 'apparently': 538,
 'doubt': 2790,
 'different': 2616,
 'shouldn': 8097,
 '25': 116,
 '00': 0,
 'copy': 2094,
 'recommend': 7250,
 'pretend': 6876,
 'exist': 3256,
 'quote': 7098,
 'civilians': 1691,
 'luck': 5441,
 'monsters': 5879,
 'extras': 3331,
 'original': 6311,
 'trailer': 9204,
 'shots': 8094,
 'kills': 5032,
 'make': 5507,
 'look': 5387,
 'interesting': 4733,
 'trailers': 9205,
 'ss': 8467,
 'hell': 4251,
 'camp': 1346,
 'informative': 4642,
 'interview': 4750,
 'line': 5308,
 'lame': 5119,
 'porno': 6757,
 'weaker': 9741,
 'real': 7207,
 'rating': 7179,
 '10': 3,
 'molly': 5856,
 'www': 9954,
 'com': 1836,
 'robert': 7588,
 'cummings': 2268,
 'day': 2371,
 'jean': 4866,
 'star': 8503,
 'beautiful': 891,
 '1940': 31,
 'starring': 8511,
 'billie': 986,
 'burke': 1282,
 '15': 12,
 'minutes': 5797,
 'looks': 5390,
 'playboy': 6676,
 'desire': 2541,
 'sisters': 8181,
 'katherine': 4982,
 'helen': 4248,
 'likes': 5291,
 'fix': 3554,
 'cars': 1435,
 'blonde': 1048,
 'social': 8302,
 'butterfly': 1307,
 'arrives': 612,
 'town': 9182,
 'believing': 930,
 'party': 6464,
 'decides': 2409,
 'attend': 702,
 'given': 3892,
 'friend': 3720,
 'mother': 5916,
 'dress': 2837,
 'connect': 1976,
 'sees': 7899,
 'dinner': 2634,
 'left': 5210,
 'club': 1772,
 'terribly': 8995,
 'drunk': 2869,
 'ride': 7533,
 'car': 1391,
 'won': 9881,
 'let': 5245,
 'drive': 2851,
 'walking': 9665,
 'awhile': 764,
 'breaking': 1175,
 'shoe': 8077,
 'gets': 3859,
 'drives': 2856,
 'passes': 6472,
 'takes': 8893,
 'wheel': 9791,
 'accidentally': 206,
 'remember': 7364,
 'blame': 1021,
 'sister': 8180,
 'shoes': 8078,
 'plus': 6701,
 'manner': 5540,
 'realize': 7218,
 'isn': 4816,
 'telling': 8967,
 'truth': 9298,
 'convicted': 2071,
 'goes': 3929,
 'marries': 5580,
 'leaves': 5204,
 'america': 439,
 'list': 5325,
 'playing': 6681,
 'taylor': 8932,
 'mgm': 5737,
 'handsome': 4137,
 'amiable': 442,
 'dazzling': 2375,
 'actress': 243,
 'constantly': 2014,
 'didn': 2603,
 'great': 4013,
 'face': 3343,
 'voice': 9626,
 'determined': 2569,
 'sympathetic': 8871,
 'lovely': 5422,
 'lousy': 5418,
 'highly': 4296,
 'recommended': 7252,
 'little': 5337,
 'gem': 3819,
 'dark': 2345,
 'overlooked': 6359,
 'known': 5074,
 'early': 2927,
 '80': 151,
 'deserves': 2535,
 'audience': 717,
 'damn': 2320,
 'shame': 8013,
 'seen': 7898,
 'compared': 1893,
 'gotten': 3965,
 'bigger': 977,
 'years': 9964,
 'notably': 6150,
 'comparisons': 1897,
 'bit': 1003,
 'similar': 8147,
 'slipped': 8252,
 'acceptance': 198,
 'remake': 7358,
 'breathe': 1181,
 'new': 6082,
 'unless': 9426,
 'drained': 2816,
 'remakes': 7359,
 'days': 2373,
 'work': 9899,
 'lesser': 5240,
 'films': 3516,
 'awful': 761,
 'ghost': 3863,
 'ship': 8064,
 'opening': 6275,
 'falling': 3377,
 'utter': 9492,
 'crap': 2169,
 'happen': 4148,
 'fall': 3375,
 'lot': 5410,
 'haven': 4201,
 'bring': 1204,
 'course': 2143,
 'got': 3961,
 'eyes': 3338,
 'anyways': 526,
 'fans': 3392,
 'cause': 1474,
 'creepy': 2203,
 'setting': 7972,
 'fairly': 3367,
 'acting': 234,
 'campy': 1354,
 'want': 9684,
 'nudity': 6180,
 'gore': 3954,
 'sorry': 8347,
 'nonetheless': 6128,
 'solid': 8315,
 'enjoy': 3099,
 'grave': 4006,
 'robber': 7583,
 'sitting': 8188,
 'cell': 1491,
 'awaiting': 750,
 'execution': 3251,
 'visited': 9612,
 'monk': 5868,
 'wishing': 9860,
 'words': 9897,
 'horrible': 4397,
 'lead': 5187,
 'reluctant': 7348,
 'tongue': 9144,
 'drink': 2847,
 'young': 9975,
 'soon': 8334,
 'undead': 9366,
 'bump': 1275,
 'york': 9974,
 'filmed': 3510,
 'brought': 1228,
 'spirit': 8424,
 'andy': 467,
 'milligan': 5769,
 'lurking': 5457,
 'comedies': 1850,
 'come': 1845,
 'rate': 7175,
 'dominic': 2770,
 'plays': 6682,
 'arthur': 619,
 'blake': 1020,
 'ron': 7626,
 'father': 3423,
 'statement': 8523,
 'getting': 3860,
 'involved': 4786,
 'tale': 8896,
 'men': 5695,
 'having': 4202,
 'grand': 3985,
 'old': 6256,
 'shows': 8109,
 'equally': 3145,
 'music': 5972,
 'jeff': 4869,
 'grace': 3972,
 'excellent': 3230,
 'effects': 2979,
 'perfect': 6539,
 'sort': 8348,
 'silliness': 8143,
 'deal': 2381,
 'fun': 3747,
 'trouble': 9288,
 'throws': 9076,
 'net': 6075,
 'wide': 9807,
 'result': 7465,
 'needed': 6047,
 'alien': 389,
 'body': 1075,
 'mix': 5830,
 'theaters': 9018,
 'later': 5153,
 'll': 5347,
 'worth': 9918,
 'liked': 5289,
 'script': 7851,
 'changed': 1533,
 'reason': 7225,
 'rodney': 7603,
 'dangerfield': 2334,
 'jackie': 4837,
 'mason': 5605,
 'did': 2602,
 'alot': 413,
 'kept': 5004,
 'flaw': 3571,
 'dan': 2324,
 'murray': 5968,
 'carl': 1411,
 'quit': 7096,
 'job': 4898,
 'assistant': 659,
 'joined': 4909,
 'military': 5764,
 'warner': 9700,
 'bros': 1224,
 'ii': 4509,
 'try': 9301,
 'seeing': 7893,
 'possibly': 6789,
 'disappointed': 2662,
 'fact': 3349,
 'director': 2646,
 'cube': 2260,
 'comedy': 1851,
 'imdb': 4531,
 'spell': 8405,
 'word': 9896,
 'reminiscent': 7373,
 'builds': 1263,
 'slowly': 8258,
 'gradually': 3976,
 'explanation': 3291,
 'mainly': 5500,
 'set': 7970,
 'respects': 7452,
 'probably': 6915,
 'commented': 1870,
 'masterpiece': 5614,
 'spanish': 8378,
 'cinema': 1673,
 'masters': 5616,
 'piece': 6623,
 'long': 5383,
 'ago': 338,
 'midnight': 5753,
 'cowboy': 2158,
 'les': 5236,
 'du': 2872,
 'realistic': 7213,
 'non': 6127,
 'spot': 8451,
 'trainspotting': 9210,
 'hard': 4159,
 'place': 6650,
 'humour': 4446,
 'obviously': 6214,
 'dramatic': 2820,
 'sense': 7926,
 'diamond': 2593,
 'resurrection': 7470,
 'neo': 6064,
 'realism': 7211,
 'mixed': 5831,
 'ken': 4999,
 'discover': 2677,
 'modern': 5849,
 'tv': 9320,
 'classic': 1709,
 'bob': 1072,
 'girlfriend': 3889,
 'named': 6003,
 'alicia': 388,
 'married': 5579,
 'bud': 1247,
 'owen': 6375,
 'works': 9904,
 'jealous': 4864,
 'hanging': 4140,
 'hangs': 4141,
 'secretary': 7881,
 'heather': 4237,
 'accident': 204,
 'prone': 6970,
 'kind': 5035,
 'lonely': 5380,
 'wishes': 9859,
 'friends': 3722,
 'end': 3069,
 'looked': 5388,
 'finally': 3521,
 'went': 9776,
 'driving': 2857,
 'wedding': 9759,
 'making': 5512,
 'tiny': 9110,
 'stuck': 8658,
 'middle': 5750,
 'happened': 4149,
 'poor': 6744,
 'ended': 3071,
 'guide': 4078,
 'fox': 3676,
 'twice': 9323,
 'putting': 7064,
 'air': 356,
 'loved': 5421,
 'cool': 2082,
 'glasses': 3901,
 'hilarious': 4298,
 'miss': 5812,
 'reading': 7203,
 'book': 1101,
 'ending': 3072,
 'missing': 5817,
 'sad': 7707,
 'treatment': 9250,
 'subject': 8692,
 'quite': 7097,
 'controversial': 2055,
 'comments': 1872,
 'distinction': 2712,
 'based': 845,
 'believe': 926,
 'portrayed': 6768,
 'basically': 850,
 'sequels': 7949,
 '30': 122,
 'values': 9507,
 'plan': 6658,
 'outer': 6331,
 'space': 8370,
 'level': 5253,
 'glen': 3902,
 'glenda': 3903,
 'ed': 2954,
 'wood': 9892,
 'religious': 7347,
 'scary': 7796,
 'add': 257,
 'slightest': 8247,
 'actually': 247,
 'close': 1758,
 'future': 3766,
 'scarier': 7791,
 'reasons': 7229,
 'code': 1792,
 'thief': 9033,
 'explain': 3287,
 'east': 2938,
 'effect': 2976,
 'happening': 4150,
 'forget': 3647,
 'stories': 8603,
 'told': 9131,
 'god': 3921,
 'frightening': 3727,
 'wild': 9818,
 'rebels': 7234,
 'frustrating': 3736,
 'deals': 2385,
 'race': 7104,
 'driver': 2854,
 'bikers': 982,
 'called': 1331,
 'satan': 7759,
 'angels': 473,
 'hang': 4139,
 'decide': 2406,
 'rob': 7581,
 'bank': 818,
 'cops': 2093,
 'report': 7406,
 'dated': 2360,
 'carry': 1433,
 'significantly': 8139,
 'crude': 2247,
 'stupid': 8679,
 'band': 814,
 'stage': 8477,
 'performing': 6549,
 'regular': 7308,
 'generic': 3830,
 'care': 1396,
 'taken': 8891,
 'filmmaker': 3512,
 'logic': 5369,
 'direction': 2643,
 'actors': 242,
 'parts': 6463,
 'major': 5505,
 'indifferent': 4612,
 'unpredictable': 9439,
 'comes': 1852,
 'florida': 3595,
 'ho': 4334,
 'worthy': 9921,
 'mystery': 5992,
 'science': 7819,
 'theater': 9017,
 '3000': 124,
 'status': 8532,
 'commentary': 1868,
 'characters': 1550,
 'screen': 7841,
 'saying': 7780,
 'pack': 6390,
 'low': 5428,
 'expectations': 3270,
 'came': 1338,
 'months': 5887,
 'tragedy': 9200,
 'open': 6273,
 'wounds': 9925,
 'thank': 9012,
 'bravery': 1167,
 'offered': 6240,
 'closure': 1765,
 'consider': 1998,
 'hidden': 4285,
 'frontier': 3731,
 'somewhat': 8326,
 'small': 8265,
 'met': 5727,
 'counting': 2130,
 'conventions': 2062,
 '2001': 100,
 'continue': 2036,
 'impressed': 4559,
 'self': 7908,
 'studio': 8663,
 'pictures': 6620,
 'fancy': 3390,
 'writers': 9940,
 'walter': 9673,
 'aka': 363,
 'mr': 5942,
 'manage': 5522,
 'create': 2183,
 'replacing': 7403,
 'ghastly': 3861,
 'experiment': 3281,
 'enterprise': 3114,
 'successful': 8719,
 'arc': 571,
 'introduction': 4765,
 'trek': 9254,
 'openly': 6276,
 'gay': 3811,
 'corey': 2098,
 'introduced': 4762,
 'second': 7876,
 'soul': 8353,
 'mate': 5622,
 'meets': 5674,
 'officer': 6245,
 'recent': 7241,
 'lines': 5312,
 'spoiler': 8438,
 'causing': 1477,
 'change': 1532,
 'conflict': 1964,
 'relationship': 7325,
 'uncertain': 9355,
 'shown': 8108,
 'chat': 1570,
 'endure': 3077,
 'gene': 3823,
 'created': 2184,
 'intention': 4723,
 'flashy': 3566,
 'battles': 868,
 'popular': 6752,
 'previous': 6886,
 'stated': 8522,
 'wish': 9857,
 'higher': 4291,
 'suffice': 8734,
 'tradition': 9195,
 'seven': 7978,
 'generation': 3828,
 'willing': 9829,
 'bet': 960,
 'final': 3519,
 'debut': 2398,
 '1958': 50,
 'enjoyed': 3101,
 'leave': 5203,
 'sons': 8332,
 'harriet': 4176,
 'dick': 2599,
 'van': 9510,
 'lucy': 5444,
 'enjoying': 3102,
 'donna': 2774,
 'reed': 7273,
 'stone': 8590,
 'intelligent': 4714,
 'mannered': 5541,
 'problem': 6916,
 'solving': 8321,
 'stay': 8533,
 'home': 4356,
 'mom': 5857,
 'june': 4956,
 'contrast': 2045,
 'ms': 5944,
 'dad': 2304,
 'boxing': 1143,
 'teaching': 8939,
 'son': 8327,
 'defend': 2430,
 'larger': 5141,
 'bully': 1271,
 'mothers': 5917,
 'neighborhood': 6056,
 'grew': 4030,
 'idealistic': 4490,
 'standards': 8496,
 'refreshing': 7291,
 'manners': 5543,
 'decision': 2411,
 'today': 9124,
 'accepted': 199,
 'indifference': 4611,
 'neighbors': 6057,
 'imagine': 4528,
 'mary': 5599,
 'parents': 6443,
 'okay': 6254,
 'leaving': 5205,
 'dog': 2755,
 'outside': 6345,
 'acceptable': 197,
 'shut': 8114,
 'supermarket': 8775,
 'cinematography': 1678,
 'highlights': 4295,
 'true': 9292,
 'account': 216,
 '1950s': 43,
X_train = cv.transform(df_train["text"])
# to dense matrix
feat_vec = np.array(X_train[0].todense())[0]
array([9926,   67,    5,    0,    1,    0,    1])

Obiekty pipeline w modelowaniu

import pandas as pd
import numpy as np
# przykład danych ustrukturyzowanych
df = pd.read_csv("students.csv")
sex race/ethnicity parental level of education lunch test preparation course math score reading score writing score target
0 female group B bachelor's degree standard none 72 72 74 0
1 female group C some college standard completed 69 90 88 1
2 female group B master's degree standard none 90 95 93 0
3 male group A associate's degree free/reduced none 47 57 44 1
4 male group C some college standard none 76 78 75 0
len(df), list(df.columns)
  'parental level of education',
  'test preparation course',
  'math score',
  'reading score',
  'writing score',
X = df.drop(columns=['target'])
y = df['target']
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# ZAMIAST OD RAZU PRZETWARZAC !!! najpierw przygotuj kroki - pipeline

numeric_features = ['math score','reading score','writing score']
categorical_features = ['sex','race/ethnicity','parental level of education','lunch','test preparation course']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())

categorical_transformer = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(transformers=[
    ("num_trans", numeric_transformer, numeric_features),
    ("cat_trans", categorical_transformer, categorical_features)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline(steps=[
    ("preproc", preprocessor),
    ("model", LogisticRegression())
from sklearn import set_config
                                                  ['math score',
                                                   'reading score',
                                                   'writing score']),
                                                  ['sex', 'race/ethnicity',
                                                   'parental level of '
                                                   'test preparation '
                ('model', LogisticRegression())])
PAMIETAJ - obiekt pipeline to obiekt pythonowy i tak jak obiekt modelu można go zapisać do pickla.

from sklearn.model_selection import train_test_split
X_tr, X_test, y_tr, y_test = train_test_split(X,y,
test_size=0.2, random_state=42)

pipeline.fit(X_tr, y_tr)

score = pipeline.score(X_test, y_test)
import joblib
joblib.dump(pipeline, 'your_pipeline.pkl')

TU ZACZYNA SIĘ MAGIA OBIEKTOWEGO PYTHONA - nie pisz kodu i nie uruchamiaj kodów wiele razy dla różnych parametrów - niech Python zrobi to za Ciebie

param_grid = [
               "model__min_samples_leaf": [1, 0.1],

from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(pipeline, param_grid,
cv=2, verbose=1, n_jobs=-1)

grid_search.fit(X_tr, y_tr)

Fitting 2 folds for each of 30 candidates, totalling 60 fits
{'model': RandomForestClassifier(min_samples_leaf=0.1, n_estimators=2),
 'model__min_samples_leaf': 0.1,
 'model__n_estimators': 2,
 'preproc__num_trans__imputer__strategy': 'mean'}
grid_search.score(X_test, y_test), grid_search.score(X_tr, y_tr)
(0.45, 0.569620253164557)

Teraz drobna modyfikacja - wiemy, że takiej zmiennej nie chcemy do modelu - ma tylko jedną wartość. Ale jak zweryfikować jakie to zmienne jeśli masz 3 mln kolumn?

df['bad_feature'] = 1
X = df.drop(columns=['target'])
y = df['target']
X_tr, X_test, y_tr, y_test = train_test_split(X,y,
test_size=0.2, random_state=42)
numeric_features = ['math score','reading score','writing score', 'bad_feature']
# znajdz sposób na automatyczny podział dla zmiennych numerycznych i nienumerycznych
grid_search = GridSearchCV(pipeline, param_grid,
cv=2, verbose=1, n_jobs=-1)

grid_search.fit(X_tr, y_tr)

Fitting 2 folds for each of 30 candidates, totalling 60 fits
{'model': RandomForestClassifier(n_estimators=2),
 'model__min_samples_leaf': 1,
 'model__n_estimators': 2,
 'preproc__num_trans__imputer__strategy': 'mean'}
grid_search.score(X_tr, y_tr), grid_search.score(X_test, y_test)
(0.8734177215189873, 0.45)


# your own transformator class

from sklearn.base import BaseEstimator, TransformerMixin

class DelOneValueFeature(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.one_value_features = []
    def fit(self, X, y=None):
        for feature in X.columns:
            unique = X[feature].unique()
            if len(unique)==1:
        return self
    def transform(self, X, y=None):
        if not self.one_value_features:
            return X
        return X.drop(axis='columns', columns=self.one_value_features)
pipeline2 = Pipeline([
    ("preprocesser", preprocessor),
    ("classifier", LogisticRegression())])
pipeline2.fit(X_tr, y_tr)
score2 = pipeline2.score(X_test, y_test)

I JUZ :)

A teraz zobacz jak prosta klasa potrafi ułatwić życie w modelach sieci neuronowej

# przykład danych nieustrukturyzowanych 

import tensorflow as tf
class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if logs.get('accuracy') > 0.95:
            print("\n osiągnięto 95% - zakończ trenowanie")
            self.model.stop_training = True
callbacks = myCallback()
mnist = tf.keras.datasets.fashion_mnist
(tr_im, tr_lab),(te_im, te_lab) = mnist.load_data()
tr_im = tr_im/255
te_im = te_im/255

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation=tf.nn.relu),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=['accuracy'])
model.fit(tr_im, tr_lab, epochs=40, callbacks=[callbacks])
Epoch 1/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4943 - accuracy: 0.8260
Epoch 2/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3734 - accuracy: 0.8651
Epoch 3/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3371 - accuracy: 0.8765
Epoch 4/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3115 - accuracy: 0.8851
Epoch 5/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2950 - accuracy: 0.8916
Epoch 6/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2792 - accuracy: 0.8969
Epoch 7/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2687 - accuracy: 0.8995
Epoch 8/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2562 - accuracy: 0.9045
Epoch 9/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2480 - accuracy: 0.9070
Epoch 10/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2378 - accuracy: 0.9112
Epoch 11/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2281 - accuracy: 0.9157
Epoch 12/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2222 - accuracy: 0.9168
Epoch 13/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2157 - accuracy: 0.9189
Epoch 14/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2116 - accuracy: 0.9205
Epoch 15/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.2036 - accuracy: 0.9237
Epoch 16/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1970 - accuracy: 0.9266
Epoch 17/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1928 - accuracy: 0.9280
Epoch 18/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1865 - accuracy: 0.9304
Epoch 19/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1820 - accuracy: 0.9314
Epoch 20/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1790 - accuracy: 0.9327
Epoch 21/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1750 - accuracy: 0.9335
Epoch 22/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1700 - accuracy: 0.9366
Epoch 23/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1650 - accuracy: 0.9382
Epoch 24/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1625 - accuracy: 0.9381
Epoch 25/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1587 - accuracy: 0.9399
Epoch 26/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1549 - accuracy: 0.9416
Epoch 27/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1515 - accuracy: 0.9437
Epoch 28/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1499 - accuracy: 0.9447
Epoch 29/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1445 - accuracy: 0.9457
Epoch 30/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1423 - accuracy: 0.9468
Epoch 31/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1401 - accuracy: 0.9468
Epoch 32/40
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1368 - accuracy: 0.9486
Epoch 33/40
1861/1875 [============================>.] - ETA: 0s - loss: 0.1334 - accuracy: 0.9505
 osiągnięto 95% - zakończ trenowanie
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1333 - accuracy: 0.9506
<keras.src.callbacks.History at 0xfffe90eb8410>