本指南基于Linux操作系统(Ubuntu 22.04 LTS)进行部署。所有软件均使用开源版本,确保长期稳定与可控。
首先更新系统并安装核心依赖库。执行以下命令:
```bash sudo apt update && sudo apt upgrade -y sudo apt install -y python3-pip python3-venv nginx redis-server postgresql postgresql-contrib libpq-dev poppler-utils tesseract-ocr tesseract-ocr-chi-sim libreoffice pandoc ```
安装完成后,启动并设置PostgreSQL和Redis开机自启:
```bash sudo systemctl enable postgresql sudo systemctl start postgresql sudo systemctl enable redis-server sudo systemctl start redis-server ```
切换到postgres用户创建数据库和专用用户:
```bash sudo -u postgres psql ```
在PostgreSQL交互命令行中,依次执行以下SQL命令:
```sql CREATE USER archive_admin WITH PASSWORD 'YourStrongPassword123!'; CREATE DATABASE military_prison_archive OWNER archive_admin; \c military_prison_archive -- 创建核心档案表 CREATE TABLE IF NOT EXISTS archive_documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), archive_number VARCHAR(50) UNIQUE NOT NULL, document_type VARCHAR(20) NOT NULL CHECK (document_type IN ('military', 'prison')), title TEXT NOT NULL, original_filename TEXT, storage_path TEXT NOT NULL, file_hash VARCHAR(64) UNIQUE NOT NULL, metadata JSONB DEFAULT '{}', security_level INTEGER NOT NULL CHECK (security_level BETWEEN 1 AND 5), created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP, indexed_at TIMESTAMP WITH TIME ZONE ); -- 创建全文检索索引 CREATE INDEX idx_fts_search ON archive_documents USING GIN (to_tsvector('chinese', title)); -- 创建审计日志表 CREATE TABLE audit_logs ( id SERIAL PRIMARY KEY, user_id VARCHAR(50), action VARCHAR(100) NOT NULL, document_id UUID REFERENCES archive_documents(id) ON DELETE SET NULL, ip_address INET, created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP ); \q ```
创建项目目录并初始化虚拟环境:
```bash mkdir -p /opt/archive-system && cd /opt/archive-system python3 -m venv venv source venv/bin/activate ```
创建requirements.txt文件,内容如下:
```txt Django==4.2.7 djangorestframework==3.14.0 psycopg2-binary==2.9.7 celery==5.3.1 django-celery-results==2.5.1 Pillow==10.0.0 pdf2image==1.16.3 python-magic==0.4.27 cryptography==41.0.4 django-guardian==2.4.0 ```
安装依赖包:
```bash pip install -r requirements.txt ```
创建Django项目和应用:
```bash django-admin startproject archive_core . python manage.py startapp archive_processor ```
编辑/opt/archive-system/archive_core/settings.py,在DATABASES部分配置如下:
```python DATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql', 'NAME': 'military_prison_archive', 'USER': 'archive_admin', 'PASSWORD': 'YourStrongPassword123!', 'HOST': 'localhost', 'PORT': '5432', } } ```
在文件末尾添加以下配置:
```python 文件存储设置 MEDIA_ROOT = '/var/archive/media' MEDIA_URL = '/media/' STATIC_ROOT = '/var/archive/static' STATIC_URL = '/static/' Celery配置 CELERY_BROKER_URL = 'redis://localhost:6379/0' CELERY_RESULT_BACKEND = 'django-db' CELERY_ACCEPT_CONTENT = ['json'] CELERY_TASK_SERIALIZER = 'json' 安全设置 SECURE_HASH_ALGORITHM = 'sha256' ALLOWED_FILE_EXTENSIONS = ['.pdf', '.tif', '.tiff', '.jpg', '.jpeg', '.png', '.doc', '.docx'] MAX_FILE_SIZE = 524288000 500MB 日志配置 LOGGING = { 'version': 1, 'handlers': { 'file': { 'level': 'INFO', 'class': 'logging.FileHandler', 'filename': '/var/log/archive_system/archive.log', }, }, 'loggers': { 'archive_processor': { 'handlers': ['file'], 'level': 'INFO', }, }, } ```
编辑/opt/archive-system/archive_processor/models.py:
```python import hashlib import os from django.db import models from django.core.validators import FileExtensionValidator class ArchiveDocument(models.Model): DOCUMENT_TYPES = [ ('military', '军队档案'), ('prison', '监狱档案'), ] archive_number = models.CharField(max_length=50, unique=True) document_type = models.CharField(max_length=20, choices=DOCUMENT_TYPES) title = models.TextField() original_file = models.FileField( upload_to='uploads/%Y/%m/%d/', validators=[FileExtensionValidator(allowed_extensions=['pdf', 'tif', 'tiff', 'jpg', 'jpeg', 'png', 'doc', 'docx'])] ) storage_path = models.TextField() file_hash = models.CharField(max_length=64, unique=True) metadata = models.JSONField(default=dict) security_level = models.IntegerField(choices=[(i, f'等级{i}') for i in range(1, 6)]) created_at = models.DateTimeField(auto_now_add=True) indexed_at = models.DateTimeField(null=True, blank=True) def calculate_file_hash(self): """计算文件SHA-256哈希值""" sha256_hash = hashlib.sha256() with self.original_file.open('rb') as f: for byte_block in iter(lambda: f.read(4096), b""): sha256_hash.update(byte_block) return sha256_hash.hexdigest() def save(self, args, kwargs): if self.original_file: self.file_hash = self.calculate_file_hash() self.storage_path = self.original_file.path super().save(args, kwargs) ```
创建/opt/archive-system/archive_processor/tasks.py:
```python from celery import shared_task import pytesseract from pdf2image import convert_from_path import tempfile import os from .models import ArchiveDocument @shared_task def extract_text_from_document(document_id): """ 从档案文件中提取文本内容 """ try: document = ArchiveDocument.objects.get(id=document_id) file_path = document.storage_path file_ext = os.path.splitext(file_path)[1].lower() extracted_text = "" if file_ext in ['.tif', '.tiff', '.jpg', '.jpeg', '.png']: 处理图像文件 extracted_text = pytesseract.image_to_string(file_path, lang='chi_sim') elif file_ext == '.pdf': 处理PDF文件 with tempfile.TemporaryDirectory() as temp_dir: images = convert_from_path(file_path, dpi=300) for i, image in enumerate(images): image_path = os.path.join(temp_dir, f'page_{i}.jpg') image.save(image_path, 'JPEG') page_text = pytesseract.image_to_string(image_path, lang='chi_sim') extracted_text += f" 第{i+1}页 \n{page_text}\n" 更新文档元数据 document.metadata.update({ 'extracted_text': extracted_text[:10000], 限制文本长度 'text_length': len(extracted_text), 'processed': True }) document.save() return {'status': 'success', 'document_id': str(document_id)} except Exception as e: return {'status': 'error', 'error': str(e)} ```

创建/opt/archive-system/archive_processor/views.py:
```python from rest_framework import viewsets, status from rest_framework.response import Response from rest_framework.decorators import action from django.core.files.storage import FileSystemStorage from .models import ArchiveDocument from .serializers import ArchiveDocumentSerializer from .tasks import extract_text_from_document import os class ArchiveDocumentViewSet(viewsets.ModelViewSet): queryset = ArchiveDocument.objects.all() serializer_class = ArchiveDocumentSerializer def create(self, request): """ 处理档案文件上传 """ serializer = self.get_serializer(data=request.data) if serializer.is_valid(): 保存文档记录 document = serializer.save() 异步启动文本提取任务 extract_text_from_document.delay(document.id) return Response({ 'status': 'success', 'document_id': str(document.id), 'archive_number': document.archive_number, 'message': '档案已成功上传,正在处理中' }, status=status.HTTP_201_CREATED) return Response(serializer.errors, status=status.HTTP_400_BAD_REQUEST) @action(detail=True, methods=['get']) def search_text(self, request, pk=None): """ 在档案文本内容中搜索关键词 """ document = self.get_object() keyword = request.query_params.get('q', '') if not keyword: return Response({'error': '请提供搜索关键词'}, status=400) if 'extracted_text' not in document.metadata: return Response({'error': '文档尚未完成文本提取'}, status=400) text_content = document.metadata.get('extracted_text', '') 简单的文本搜索(生产环境应使用Elasticsearch) occurrences = text_content.lower().count(keyword.lower()) return Response({ 'document_id': str(document.id), 'keyword': keyword, 'occurrences': occurrences, 'title': document.title }) ```
创建/etc/nginx/sites-available/archive_system配置文件:
```nginx server { listen 80; server_name archive.your-domain.com; location /static/ { alias /var/archive/static/; expires 30d; } location /media/ { alias /var/archive/media/; internal; } location / { proxy_pass http://127.0.0.1:8000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; 文件上传大小限制 client_max_body_size 500M; proxy_read_timeout 300s; } } ```
启用站点并重启Nginx:
```bash sudo ln -s /etc/nginx/sites-available/archive_system /etc/nginx/sites-enabled/ sudo nginx -t sudo systemctl reload nginx ```
创建Django Gunicorn服务文件/etc/systemd/system/archive-api.service:
```ini [Unit] Description=Archive System API Service After=network.target postgresql.service [Service] User=www-data Group=www-data WorkingDirectory=/opt/archive-system Environment="PATH=/opt/archive-system/venv/bin" ExecStart=/opt/archive-system/venv/bin/gunicorn archive_core.wsgi:application \ --bind 127.0.0.1:8000 \ --workers 3 \ --worker-class sync \ --timeout 120 \ --access-logfile /var/log/archive_system/gunicorn_access.log \ --error-logfile /var/log/archive_system/gunicorn_error.log [Install] WantedBy=multi-user.target ```
创建Celery服务文件/etc/systemd/system/archive-celery.service:
```ini [Unit] Description=Archive System Celery Service After=network.target redis-server.service [Service] User=www-data Group=www-data WorkingDirectory=/opt/archive-system Environment="PATH=/opt/archive-system/venv/bin" ExecStart=/opt/archive-system/venv/bin/celery -A archive_core worker \ --loglevel=info \ --logfile=/var/log/archive_system/celery.log [Install] WantedBy=multi-user.target ```
启动并启用服务:
```bash sudo systemctl daemon-reload sudo systemctl start archive-api sudo systemctl enable archive-api sudo systemctl start archive-celery sudo systemctl enable archive-celery ```
执行数据库迁移并创建超级用户:
```bash cd /opt/archive-system source venv/bin/activate python manage.py makemigrations python manage.py migrate python manage.py createsuperuser ```
按照提示输入管理员用户名、邮箱和密码。
使用curl命令上传档案文件:
```bash curl -X POST http://archive.your-domain.com/api/documents/ \ -H "Authorization: Token your-admin-token" \ -F "archive_number=ML20230001" \ -F "document_type=military" \ -F "title=某部2023年度人员档案" \ -F "security_level=3" \ -F "original_file=@/path/to/your/document.pdf" ```
成功响应示例:
```json { "status": "success", "document_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "archive_number": "ML20230001", "message": "档案已成功上传,正在处理中" } ```
```bash curl -X GET http://archive.your-domain.com/api/documents/a1b2c3d4-e5f6-7890-abcd-ef1234567890/ \ -H "Authorization: Token your-admin-token" ```
```bash curl -X GET "http://archive.your-domain.com/api/documents/a1b2c3d4-e5f6-7890-abcd-ef1234567890/search_text/?q=退役" \ -H "Authorization: Token your-admin-token" ```
创建/opt/backup_archive.sh备份脚本:
```bash !/bin/bash BACKUP_DIR="/backup/archive" DATE=$(date +%Y%m%d_%H%M%S) DB_NAME="military_prison_archive" 创建备份目录 mkdir -p $BACKUP_DIR/$DATE 备份数据库 sudo -u postgres pg_dump $DB_NAME > $BACKUP_DIR/$DATE/db_backup.sql 备份上传的文件 tar -czf $BACKUP_DIR/$DATE/media_backup.tar.gz /var/archive/media/ 保留最近7天的备份 find $BACKUP_DIR -type d -mtime +7 -exec rm -rf {} \; echo "Backup completed at $(date)" >> /var/log/archive_backup.log ```
设置定时任务,每天凌晨2点执行备份:
```bash sudo crontab -e 添加以下行 0 2 /bin/bash /opt/backup_archive.sh ```
创建定期验证脚本