Jupyter Notebook is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and narrative text. Archil makes it simple to run Jupyter notebooks in a completely serverless manner, storing all data and notebooks directly on Amazon S3 while maintaining interactive performance through intelligent caching.

Persistent Storage

All notebooks, datasets, and trained models automatically stored in S3, persisting across compute sessions

Cost Efficiency

Pay only for high-speed storage when active. Data remains in S3 at standard costs when idle

Scalability

Scale compute resources up or down without data migration or storage limits

Collaboration

Multiple team members access same datasets and notebooks by mounting the same Archil disk
This guide will walk you through setting up a serverless Jupyter environment for data science workflows, including training a PyTorch model on the MNIST dataset with all data stored in S3.

Create an Archil disk

First, follow the Archil Getting Started Guide to create an Archil disk that will serve as your data science workspace.

Mount your Archil disk

Mount your Archil disk to create your data science workspace:
# Create mount directory
sudo mkdir -p /mnt/archil

# Mount Archil disk
sudo archil mount dsk-DISKID /mnt/archil --region aws-us-east-1 --auth-token TOKEN

# Create datascience workspace directories
sudo mkdir -p /mnt/archil/datascience/notebooks
sudo mkdir -p /mnt/archil/datascience/datasets
sudo mkdir -p /mnt/archil/datascience/models
sudo mkdir -p /mnt/archil/datascience/venv
sudo chown -R $USER:$USER /mnt/archil/datascience

Set up Python virtual environment on Archil disk

Create a virtual environment directly on your Archil disk to ensure all dependencies persist in S3:
# Create virtual environment on Archil disk
cd /mnt/archil/datascience
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

# Upgrade pip and install data science packages
pip install --upgrade pip
pip install jupyter torch torchvision matplotlib numpy pandas

# Verify installation
python -c "import torch; print(f'PyTorch version: {torch.__version__}')"

Configure Jupyter to use the virtual environment

Set up Jupyter to use your virtual environment and Archil workspace:
# Generate Jupyter config (while venv is activated)
jupyter notebook --generate-config

# Create Jupyter configuration
cat > ~/.jupyter/jupyter_notebook_config.py << 'EOF'
c.NotebookApp.notebook_dir = '/mnt/archil/datascience/notebooks'
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.port = 8888
c.NotebookApp.open_browser = False
c.NotebookApp.allow_root = True
EOF

Download the MNIST training notebook

Instead of creating the notebook from scratch, download our pre-built tutorial notebook:
# Navigate to notebooks directory
cd /mnt/archil/datascience/notebooks

# Download the MNIST PyTorch tutorial notebook
curl -O https://s3.amazonaws.com/archil-client/docs/artifacts/guides/data-science/mnist-pytorch-tutorial.ipynb

# Verify the download
ls -la mnist-pytorch-tutorial.ipynb

Start Jupyter and run the tutorial

Start Jupyter notebook server with your virtual environment:
# Make sure you're in the Archil disk and virtual environment is activated
cd /mnt/archil/datascience
source venv/bin/activate

# Start Jupyter notebook
jupyter notebook --notebook-dir=/mnt/archil/datascience/notebooks
Open your browser and navigate to the Jupyter interface. You’ll see the mnist-pytorch-tutorial.ipynb notebook ready to run. The notebook includes:
  • Data loading: Downloads MNIST dataset directly to your Archil disk
  • Model definition: Simple neural network for digit classification
  • Training loop: Complete training pipeline with progress tracking
  • Evaluation: Model accuracy assessment on test data
  • Persistence: Automatic model saving to S3 via Archil
  • Visualization: Training progress and prediction examples

Advanced workflows

Loading pre-trained models

# Load a previously saved model
checkpoint = torch.load('../models/mnist_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
print(f'Loaded model with accuracy: {checkpoint["accuracy"]:.2f}%')

Working with larger datasets

# For larger datasets, you can stream data directly from S3
# The Archil cache will intelligently manage frequently accessed data

# Example: Custom dataset class for large image datasets
class LargeImageDataset(torch.utils.data.Dataset):
    def __init__(self, data_dir, transform=None):
        self.data_dir = data_dir  # Points to Archil-mounted directory
        self.transform = transform
        self.image_files = os.listdir(data_dir)
    
    def __len__(self):
        return len(self.image_files)
    
    def __getitem__(self, idx):
        # Files are loaded on-demand from S3 via Archil cache
        img_path = os.path.join(self.data_dir, self.image_files[idx])
        # ... load and process image

Monitoring and optimization

You can monitor your data science workspace and optimize performance by checking disk usage and managing your virtual environment dependencies as needed.

Cleanup

When you’re done with your session, you can safely stop Jupyter:
# Stop Jupyter
# Ctrl+C in the terminal where Jupyter is running
Your data remains safely stored in S3 and can be accessed again by mounting the same disk in future sessions.

Next steps

This tutorial demonstrated the basics of serverless Jupyter notebooks with PyTorch. You can extend this setup for more complex workflows:
  • Distributed training across multiple compute instances
  • Hyperparameter tuning with automated experiment tracking
  • Model serving by deploying trained models from S3
  • Data pipelines that process large datasets stored in object storage
All while maintaining the serverless benefits of Archil’s intelligent caching and S3 integration.