Using the Custom Model in ThanoSQL¶

Tutorial Difficulty: ★★☆☆☆
10 min read
Languages: SQL (50%), Python (50%)
File location: tutorial_en/thanosql_ml/udm_tutorial.ipynb
References: Beans Dataset

Tutorial Introduction¶

The corresponding feature works seamlessly in paid versions.

ThanoSQL provides a feature to upload models you have created to the ThanoSQL workspace and database and use them for prediction.

In This Tutorial

👉 This tutorial uses the Beans dataset. This dataset is of leaf images taken in the field in different districts in Uganda by the Makerere AI lab in collaboration with the National Crops Resources Research Institute(NaCRRI), the national body in charge of research in agriculture in Uganda.

#. Prepare the Model and Dataset Using Python¶

Prepare Dataset¶

Download and Unzip Data¶

In [1]:

            
                Copied!
                
import os
from shutil import unpack_archive
from urllib.request import urlretrieve

url = "https://storage.googleapis.com/ibeans"

for split in ["train", "validation", "test"]:
    urlretrieve(f"{url}/{split}.zip", f"{split}.zip")
    unpack_archive(f"{split}.zip", ".")
    os.remove(f"{split}.zip")
import os
from shutil import unpack_archive
from urllib.request import urlretrieve

url = "https://storage.googleapis.com/ibeans"

for split in ["train", "validation", "test"]:
    urlretrieve(f"{url}/{split}.zip", f"{split}.zip")
    unpack_archive(f"{split}.zip", ".")
    os.remove(f"{split}.zip")

Install Necessary Packages¶

In [ ]:

            
                Copied!
                
!pip install torch torchvision
!pip install torch torchvision

Create a Training Dataset¶

The following code block has been referenced from this link and has been modified for this tutorial's needs.

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
from torch.utils.data import DataLoader
from torchvision import transforms as T
from torchvision.datasets import ImageFolder

data_transforms = {
    "train": T.Compose(
        [
            T.RandomResizedCrop(224),
            T.RandomHorizontalFlip(),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    ),
    "validation": T.Compose(
        [
            T.Resize(224),
            T.CenterCrop(224),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    ),
}

image_datasets = {
    split: ImageFolder(split, data_transforms[split])
    for split in ["train", "validation"]
}
dataloaders = {
    split: DataLoader(image_datasets[split], batch_size=8, shuffle=split == "train")
    for split in ["train", "validation"]
}
dataset_sizes = {split: len(image_datasets[split]) for split in ["train", "validation"]}
from torch.utils.data import DataLoader
from torchvision import transforms as T
from torchvision.datasets import ImageFolder

data_transforms = {
    "train": T.Compose(
        [
            T.RandomResizedCrop(224),
            T.RandomHorizontalFlip(),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    ),
    "validation": T.Compose(
        [
            T.Resize(224),
            T.CenterCrop(224),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    ),
}

image_datasets = {
    split: ImageFolder(split, data_transforms[split])
    for split in ["train", "validation"]
}
dataloaders = {
    split: DataLoader(image_datasets[split], batch_size=8, shuffle=split == "train")
    for split in ["train", "validation"]
}
dataset_sizes = {split: len(image_datasets[split]) for split in ["train", "validation"]}

Prepare the Model¶

Create a Model Training Function¶

In [3]:

            
                Copied!
                
                    
                    
                
                

        
import time
import copy
import torch


def train_model(model, criterion, optimizer, num_epochs=3):
    start_time = time.time()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    best_model_weights = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print(f"Epoch {epoch}/{num_epochs - 1}")
        print("-" * 10)

        # Every epoch goes through a training and validation phase
        for phase in ["train", "validation"]:
            if phase == "train":
                model.train()
            else:
                model.eval()

            running_loss = 0.0
            running_corrects = 0

            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                optimizer.zero_grad()

                # Forward propagation 
                with torch.set_grad_enabled(phase == "train"):
                    outputs = model(inputs)
                    preds = torch.argmax(outputs, dim=1)
                    loss = criterion(outputs, labels)

                    # Backward propagation during training phase only 
                    if phase == "train":
                        loss.backward()
                        optimizer.step()

                # Statistics 
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects / dataset_sizes[phase]

            print(f"{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}")

            # Save if the model accuracy is higher than the previous accuracy 
            if phase == "validation" and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_weights = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - start_time
    print(f"Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s")
    print(f"Best val Acc: {best_acc:4f}")

    model.load_state_dict(best_model_weights)
    return model
import time
import copy
import torch


def train_model(model, criterion, optimizer, num_epochs=3):
    start_time = time.time()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    best_model_weights = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print(f"Epoch {epoch}/{num_epochs - 1}")
        print("-" * 10)

        # Every epoch goes through a training and validation phase
        for phase in ["train", "validation"]:
            if phase == "train":
                model.train()
            else:
                model.eval()

            running_loss = 0.0
            running_corrects = 0

            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                optimizer.zero_grad()

                # Forward propagation 
                with torch.set_grad_enabled(phase == "train"):
                    outputs = model(inputs)
                    preds = torch.argmax(outputs, dim=1)
                    loss = criterion(outputs, labels)

                    # Backward propagation during training phase only 
                    if phase == "train":
                        loss.backward()
                        optimizer.step()

                # Statistics 
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects / dataset_sizes[phase]

            print(f"{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}")

            # Save if the model accuracy is higher than the previous accuracy 
            if phase == "validation" and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_weights = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - start_time
    print(f"Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s")
    print(f"Best val Acc: {best_acc:4f}")

    model.load_state_dict(best_model_weights)
    return model

Load the Model¶

This tutorial uses mobilevit v2 as it has a high accuracy for a lightweight model.

In [ ]:

            
                Copied!
                
model = torch.hub.load("rwightman/pytorch-image-models", "mobilevitv2_050", pretrained=True, num_classes=3)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()
model = torch.hub.load("rwightman/pytorch-image-models", "mobilevitv2_050", pretrained=True, num_classes=3)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()

Train and Save a Model¶

In [5]:

            
                Copied!
                
trained_model = train_model(model, criterion, optimizer, num_epochs=1)
trained_model = train_model(model, criterion, optimizer, num_epochs=1)

Epoch 0/0
----------
train Loss: 0.5641 Acc: 0.8008
validation Loss: 0.2618 Acc: 0.8947

Training complete in 1m 6s
Best val Acc: 0.894737

In [6]:

            
                Copied!
                
torch.save(trained_model, "trained_model.pth")
torch.save(trained_model, "trained_model.pth")

Create a Dataframe to Insert into ThanoSQL¶

In [7]:

            
                Copied!
                
import numpy as np
import pandas as pd

test_dataset = ImageFolder("test", data_transforms["validation"])

data = np.stack([img.numpy() for img, _ in test_dataset])
df = pd.DataFrame(pd.Series(data.tolist()), columns=["image"])  # column name must be an "image"
df.to_pickle("test_data.pkl")
import numpy as np
import pandas as pd

test_dataset = ImageFolder("test", data_transforms["validation"])

data = np.stack([img.numpy() for img, _ in test_dataset])
df = pd.DataFrame(pd.Series(data.tolist()), columns=["image"])  # column name must be an "image"
df.to_pickle("test_data.pkl")

0. Prepare Dataset¶

As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.

In [ ]:

            
                Copied!
                
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
%load_ext thanosql
%thanosql API_TOKEN=

Prepare Dataset¶

In [9]:

            
                Copied!
                
%%thanosql
COPY beans_test 
OPTIONS (if_exists='replace')
FROM 'test_data.pkl'
%%thanosql
COPY beans_test 
OPTIONS (if_exists='replace')
FROM 'test_data.pkl'

Success

Query Details

"COPY" specifies the name of the dataset to be saved as a database table.
"OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')

1. Check Dataset¶

For this tutorial, we use the beans_test table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.

In [10]:

            
                Copied!
                
%%thanosql
SELECT *
FROM beans_test
LIMIT 5
%%thanosql
SELECT *
FROM beans_test
LIMIT 5

Out[10]:

	image
0	[[[-0.028684020042419434, -0.04580877348780632...
1	[[[-0.0629335269331932, -0.0629335269331932, -...
2	[[[1.9577873945236206, 1.8721636533737183, 1.7...
3	[[[0.21106265485286713, 0.0569397434592247, -0...
4	[[[-1.3815395832061768, -1.432913899421692, -1...

Understanding the Data Table

The beans_test table contains the following information.

image: image saved in numpy format

2. Upload Custom Model¶

To upload a model created using Python in the previous step, run the following query and save the model as beans_mobilevit.

In [11]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
UPLOAD MODEL beans_mobilevit
OPTIONS (
    framework='pytorch',
    overwrite=True
    )
FROM 'trained_model.pth'
%%thanosql
UPLOAD MODEL beans_mobilevit
OPTIONS (
    framework='pytorch',
    overwrite=True
    )
FROM 'trained_model.pth'

Success

Query Details

"UPLOAD MODEL" upload the model with a name of beans_mobilevit to the ThanoSQL workspace.
"OPTIONS" specifies the option values to be used for the UPLOAD MODEL clause.
- "framework": specifies the model framework (str, default: 'pytorch')
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)

As of right now, ThanoSQL only supports Pytorch model for UPLOAD MODEL clause.

3. Predict Using a Custom Model¶

To predict class of the beans using a custom model, run the following query.

In [12]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PREDICT USING beans_mobilevit
OPTIONS (
    result_col='predicted'
    )
AS (
    SELECT *
    FROM beans_test
    ORDER BY RANDOM()
    LIMIT 5
    )
%%thanosql
PREDICT USING beans_mobilevit
OPTIONS (
    result_col='predicted'
    )
AS (
    SELECT *
    FROM beans_test
    ORDER BY RANDOM()
    LIMIT 5
    )

Out[12]:

	image	predicted
0	[[[-0.7650483846664429, -0.7821731567382812, -...	[-2.5088071823120117, -0.03282929211854935, 2....
1	[[[1.4097952842712402, 1.3926706314086914, 1.3...	[-1.7204804420471191, -1.7354539632797241, 3.5...
2	[[[-1.1760425567626953, -1.1931673288345337, -...	[-0.5441469550132751, 2.5831964015960693, -2.0...
3	[[[-1.2445416450500488, -1.278791069984436, -1...	[-1.5955406427383423, -2.174574613571167, 3.78...
4	[[[0.4165596663951874, 0.33093592524528503, 0....	[-1.5648517608642578, -1.0658249855041504, 2.6...

Query details

"PREDICT USING" predicts the outcome using the beans_mobilevit.
"OPTIONS" specifies the option values to be used for prediction.
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')

In [13]:

            
                Copied!
                
pred_df = _ # get the object that has been used last 
pred_df["predict_result"] = pred_df["predict_result"].apply(np.argmax)
pred_df["predict_result"] = pred_df["predict_result"].apply(test_dataset.classes.__getitem__)
pred_df
pred_df = _ # get the object that has been used last 
pred_df["predict_result"] = pred_df["predict_result"].apply(np.argmax)
pred_df["predict_result"] = pred_df["predict_result"].apply(test_dataset.classes.__getitem__)
pred_df

Out[13]:

	image	predicted
0	[[[-0.7650483846664429, -0.7821731567382812, -...	healthy
1	[[[1.4097952842712402, 1.3926706314086914, 1.3...	healthy
2	[[[-1.1760425567626953, -1.1931673288345337, -...	bean_rust
3	[[[-1.2445416450500488, -1.278791069984436, -1...	healthy
4	[[[0.4165596663951874, 0.33093592524528503, 0....	healthy

4. In Conclusion¶

In this tutorial, we uploaded a custom model to ThanoSQL and used that model for prediction of the classes of beans. You can refer back to this tutorial to upload your own custom model and use it within ThanoSQL.

Inquiries About Deploying a Model for Your Own Service

If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊

For inquiries regarding building an user defined model: contact@smartmind.team

Last update: 2023-08-31