> ## Documentation Index
> Fetch the complete documentation index at: https://docs.archil.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Jupyter Notebooks

> Learn how to run Jupyter notebooks with PyTorch and MNIST training directly on Amazon S3 using Archil

[Jupyter Notebook](https://jupyter.org/) is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and narrative text. Archil makes it simple to run Jupyter notebooks in a completely serverless manner, storing all data and notebooks directly on Amazon S3 while maintaining interactive performance through intelligent caching.

<CardGroup cols={2}>
  <Card title="Persistent Storage" icon="database">
    All notebooks, datasets, and trained models automatically stored in S3, persisting across compute sessions
  </Card>

  <Card title="Cost Efficiency" icon="dollar-sign">
    Pay only for high-speed storage when active. Data remains in S3 at standard costs when idle
  </Card>

  <Card title="Scalability" icon="arrows-up-down">
    Scale compute resources up or down without data migration or storage limits
  </Card>

  <Card title="Collaboration" icon="users">
    Multiple team members access same datasets and notebooks by mounting the same Archil disk
  </Card>
</CardGroup>

This guide will walk you through setting up a serverless Jupyter environment for data science workflows, including training a PyTorch model on the MNIST dataset with all data stored in S3.

## Create an Archil disk

First, follow the Archil [Getting Started Guide](/getting-started/quickstart) to create an Archil disk that will serve as your data science workspace.

## Mount your Archil disk

Mount your Archil disk to create your data science workspace:

```bash theme={null}
# Create mount directory
sudo mkdir -p /mnt/archil

# Mount Archil disk
export ARCHIL_MOUNT_TOKEN="<token>"
sudo --preserve-env=ARCHIL_MOUNT_TOKEN archil mount <disk-name> /mnt/archil --region aws-us-east-1

# Create datascience workspace directories
sudo mkdir -p /mnt/archil/datascience/notebooks
sudo mkdir -p /mnt/archil/datascience/datasets
sudo mkdir -p /mnt/archil/datascience/models
sudo mkdir -p /mnt/archil/datascience/venv
sudo chown -R $USER:$USER /mnt/archil/datascience
```

## Set up Python virtual environment on Archil disk

Create a virtual environment directly on your Archil disk to ensure all dependencies persist in S3:

```bash theme={null}
# Create virtual environment on Archil disk
cd /mnt/archil/datascience
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

# Upgrade pip and install data science packages
pip install --upgrade pip
pip install jupyter torch torchvision matplotlib numpy pandas

# Verify installation
python -c "import torch; print(f'PyTorch version: {torch.__version__}')"
```

## Configure Jupyter to use the virtual environment

Set up Jupyter to use your virtual environment and Archil workspace:

```bash theme={null}
# Generate Jupyter config (while venv is activated)
jupyter notebook --generate-config

# Create Jupyter configuration
cat > ~/.jupyter/jupyter_notebook_config.py << 'EOF'
c.NotebookApp.notebook_dir = '/mnt/archil/datascience/notebooks'
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.port = 8888
c.NotebookApp.open_browser = False
c.NotebookApp.allow_root = True
EOF
```

## Download the MNIST training notebook

Instead of creating the notebook from scratch, download our pre-built tutorial notebook:

```bash theme={null}
# Navigate to notebooks directory
cd /mnt/archil/datascience/notebooks

# Download the MNIST PyTorch tutorial notebook
curl -O https://s3.amazonaws.com/archil-client/docs/artifacts/guides/data-science/mnist-pytorch-tutorial.ipynb

# Verify the download
ls -la mnist-pytorch-tutorial.ipynb
```

## Start Jupyter and run the tutorial

Start Jupyter notebook server with your virtual environment:

```bash theme={null}
# Make sure you're in the Archil disk and virtual environment is activated
cd /mnt/archil/datascience
source venv/bin/activate

# Start Jupyter notebook
jupyter notebook --notebook-dir=/mnt/archil/datascience/notebooks
```

Open your browser and navigate to the Jupyter interface. You'll see the `mnist-pytorch-tutorial.ipynb` notebook ready to run.

The notebook includes:

* **Data loading**: Downloads MNIST dataset directly to your Archil disk
* **Model definition**: Simple neural network for digit classification
* **Training loop**: Complete training pipeline with progress tracking
* **Evaluation**: Model accuracy assessment on test data
* **Persistence**: Automatic model saving to S3 via Archil
* **Visualization**: Training progress and prediction examples

## Advanced workflows

### Loading pre-trained models

```python theme={null}
# Load a previously saved model
checkpoint = torch.load('../models/mnist_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
print(f'Loaded model with accuracy: {checkpoint["accuracy"]:.2f}%')
```

### Working with larger datasets

```python expandable theme={null}
# For larger datasets, you can stream data directly from S3
# The Archil cache will intelligently manage frequently accessed data

# Example: Custom dataset class for large image datasets
class LargeImageDataset(torch.utils.data.Dataset):
    def __init__(self, data_dir, transform=None):
        self.data_dir = data_dir  # Points to Archil-mounted directory
        self.transform = transform
        self.image_files = os.listdir(data_dir)
    
    def __len__(self):
        return len(self.image_files)
    
    def __getitem__(self, idx):
        # Files are loaded on-demand from S3 via Archil cache
        img_path = os.path.join(self.data_dir, self.image_files[idx])
        # ... load and process image
```

## Monitoring and optimization

You can monitor your data science workspace and optimize performance by checking disk usage and managing your virtual environment dependencies as needed.

## Cleanup

When you're done with your session, you can safely stop Jupyter:

```bash theme={null}
# Stop Jupyter
# Ctrl+C in the terminal where Jupyter is running
```

Your data remains safely stored in S3 and can be accessed again by mounting the same disk in future sessions.

## Next steps

This tutorial demonstrated the basics of serverless Jupyter notebooks with PyTorch. You can extend this setup for more complex workflows:

* **Distributed training** across multiple compute instances
* **Hyperparameter tuning** with automated experiment tracking
* **Model serving** by deploying trained models from S3
* **Data pipelines** that process large datasets stored in object storage

All while maintaining the serverless benefits of Archil's intelligent caching and S3 integration.
