Learn how to run Jupyter notebooks with PyTorch and MNIST training directly on Amazon S3 using Archil
Jupyter Notebook is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and narrative text. Archil makes it simple to run Jupyter notebooks in a completely serverless manner, storing all data and notebooks directly on Amazon S3 while maintaining interactive performance through intelligent caching.
Persistent Storage
All notebooks, datasets, and trained models automatically stored in S3, persisting across compute sessions
Cost Efficiency
Pay only for high-speed storage when active. Data remains in S3 at standard costs when idle
Scalability
Scale compute resources up or down without data migration or storage limits
Collaboration
Multiple team members access same datasets and notebooks by mounting the same Archil disk
This guide will walk you through setting up a serverless Jupyter environment for data science workflows, including training a PyTorch model on the MNIST dataset with all data stored in S3.
Start Jupyter notebook server with your virtual environment:
Copy
Ask AI
# Make sure you're in the Archil disk and virtual environment is activatedcd /mnt/archil/datasciencesource venv/bin/activate# Start Jupyter notebookjupyter notebook --notebook-dir=/mnt/archil/datascience/notebooks
Open your browser and navigate to the Jupyter interface. You’ll see the mnist-pytorch-tutorial.ipynb notebook ready to run.The notebook includes:
Data loading: Downloads MNIST dataset directly to your Archil disk
Model definition: Simple neural network for digit classification
Training loop: Complete training pipeline with progress tracking
Evaluation: Model accuracy assessment on test data
Persistence: Automatic model saving to S3 via Archil
Visualization: Training progress and prediction examples
# Load a previously saved modelcheckpoint = torch.load('../models/mnist_model.pth')model.load_state_dict(checkpoint['model_state_dict'])optimizer.load_state_dict(checkpoint['optimizer_state_dict'])print(f'Loaded model with accuracy: {checkpoint["accuracy"]:.2f}%')
# For larger datasets, you can stream data directly from S3# The Archil cache will intelligently manage frequently accessed data# Example: Custom dataset class for large image datasetsclass LargeImageDataset(torch.utils.data.Dataset): def __init__(self, data_dir, transform=None): self.data_dir = data_dir # Points to Archil-mounted directory self.transform = transform self.image_files = os.listdir(data_dir) def __len__(self): return len(self.image_files) def __getitem__(self, idx): # Files are loaded on-demand from S3 via Archil cache img_path = os.path.join(self.data_dir, self.image_files[idx]) # ... load and process image
You can monitor your data science workspace and optimize performance by checking disk usage and managing your virtual environment dependencies as needed.