环保资质文件通常包括环评批复、排污许可证、验收报告等,具有格式多样(PDF、扫描件、Word)、有效期管理严格、关联性强的特点。一个合格的档案管理系统必须解决纸质档案电子化、结构化存储、到期自动提醒三大痛点。
推荐采用微服务架构,将系统拆分为用户服务、档案服务、OCR服务、工作流服务和通知服务。核心是档案服务,它负责所有文件的元数据管理和物理存储。数据库选择PostgreSQL 14+,利用其JSONB字段灵活存储资质文件的动态属性,并启用全文检索扩展。
以下操作在Ubuntu 22.04 LTS系统上进行,使用Docker容器化部署以保证环境一致性。
首先安装Docker和Docker Compose:
``` sudo apt update sudo apt install -y docker.io docker-compose sudo systemctl enable --now docker ```创建数据库目录并启动:
``` mkdir -p ~/archive-data/postgres docker run -d \ --name postgres-archive \ -e POSTGRES_PASSWORD=YourStrongPassword123 \ -e POSTGRES_DB=archive_db \ -v ~/archive-data/postgres:/var/lib/postgresql/data \ -p 5432:5432 \ postgres:14-alpine ```连接数据库并创建核心表结构:
``` docker exec -it postgres-archive psql -U postgres -d archive_db ```在数据库交互界面中执行以下SQL:
``` CREATE TABLE environmental_qualifications ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), qualification_number VARCHAR(100) NOT NULL UNIQUE, company_name VARCHAR(200) NOT NULL, qualification_type VARCHAR(50) NOT NULL CHECK ( qualification_type IN ('EIA_APPROVAL', 'DISCHARGE_PERMIT', 'ACCEPTANCE_REPORT') ), issue_date DATE NOT NULL, expiry_date DATE NOT NULL, issuing_authority VARCHAR(150), file_metadata JSONB NOT NULL DEFAULT '{}', status VARCHAR(20) DEFAULT 'VALID' CHECK ( status IN ('VALID', 'EXPIRING', 'EXPIRED', 'REVOKED') ), created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE INDEX idx_qualification_expiry ON environmental_qualifications(expiry_date); CREATE INDEX idx_qualification_search ON environmental_qualifications USING GIN(file_metadata); ```环保资质文件多为扫描件,需自动提取关键信息。使用开源的Tesseract 5配合预处理提升识别率。
安装OCR依赖:
``` sudo apt install -y tesseract-ocr tesseract-ocr-chi-sim pip install pytesseract Pillow opencv-python ```创建OCR处理脚本 ocr_processor.py:
``` import cv2 import pytesseract from PIL import Image import json import re def process_qualification_document(image_path): 图像预处理 img = cv2.imread(image_path) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) _, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY) OCR识别 custom_config = r'--oem 3 --psm 6 -l chi_sim+eng' text = pytesseract.image_to_string(thresh, config=custom_config) 关键信息正则提取 extracted = {} 提取发证机关 authority_pattern = r'发证机关[::]\s([^\n]+)' match = re.search(authority_pattern, text) if match: extracted['issuing_authority'] = match.group(1).strip() 提取证书编号 number_pattern = r'编号[::]\s([A-Za-z0-9-]+)' match = re.search(number_pattern, text) if match: extracted['qualification_number'] = match.group(1).strip() 提取有效期 date_pattern = r'有效期至[::]\s(\d{4})年(\d{1,2})月(\d{1,2})日' match = re.search(date_pattern, text) if match: year, month, day = match.groups() extracted['expiry_date'] = f"{year}-{month.zfill(2)}-{day.zfill(2)}" return extracted ```使用MinIO作为分布式对象存储,替代传统文件系统。
通过Docker Compose部署MinIO:
``` version: '3.8' services: minio: image: minio/minio:latest command: server /data --console-address ":9001" environment: MINIO_ROOT_USER: archiveadmin MINIO_ROOT_PASSWORD: YourMinioPassword123 ports: - "9000:9000" - "9001:9001" volumes: - ~/archive-data/minio:/data ```创建文件上传客户端 storage_client.py:
``` from minio import Minio from minio.error import S3Error import uuid class ArchiveStorage: def __init__(self): self.client = Minio( "localhost:9000", access_key="archiveadmin", secret_key="YourMinioPassword123", secure=False ) 确保存储桶存在 if not self.client.bucket_exists("qualifications"): self.client.make_bucket("qualifications") def upload_file(self, file_path, company_name, doc_type): object_name = f"{company_name}/{doc_type}/{uuid.uuid4()}.pdf" self.client.fput_object( "qualifications", object_name, file_path ) return object_name ```
创建定时任务,每天检查即将到期的资质。
设置系统Cron任务:
``` 编辑crontab crontab -e 添加以下行,每天上午9点检查 0 9 /usr/bin/python3 /opt/archive-system/check_expiry.py ```创建检查脚本 check_expiry.py:
``` import psycopg2 from datetime import datetime, timedelta import smtplib from email.mime.text import MIMEText def check_and_notify(): conn = psycopg2.connect( host="localhost", database="archive_db", user="postgres", password="YourStrongPassword123" ) cursor = conn.cursor() 查询30天内到期的资质 cursor.execute(""" SELECT qualification_number, company_name, expiry_date FROM environmental_qualifications WHERE expiry_date BETWEEN %s AND %s AND status = 'VALID' """, (datetime.now().date(), datetime.now().date() + timedelta(days=30))) expiring = cursor.fetchall() if expiring: 发送邮件提醒 msg = MIMEText(f"以下环保资质即将到期:\n\n" + "\n".join([f"{q[0]} - {q[1]} 到期日:{q[2]}" for q in expiring])) msg['Subject'] = '环保资质到期提醒' msg['From'] = 'archive-system@yourcompany.com' msg['To'] = 'admin@yourcompany.com' 配置SMTP发送 with smtplib.SMTP('smtp.yourcompany.com', 587) as server: server.starttls() server.login('user', 'password') server.send_message(msg) cursor.close() conn.close() ```使用FastAPI创建RESTful API:
``` pip install fastapi uvicorn python-multipart ```创建主应用文件 main.py:
``` from fastapi import FastAPI, UploadFile, File, Form from datetime import datetime import psycopg2 import uuid from storage_client import ArchiveStorage from ocr_processor import process_qualification_document app = FastAPI() storage = ArchiveStorage() @app.post("/api/qualifications/upload") async def upload_qualification( company_name: str = Form(...), qualification_type: str = Form(...), issue_date: str = Form(...), file: UploadFile = File(...) ): 保存上传文件 temp_path = f"/tmp/{uuid.uuid4()}.pdf" with open(temp_path, "wb") as f: content = await file.read() f.write(content) OCR提取信息 ocr_data = process_qualification_document(temp_path) 存储到MinIO object_name = storage.upload_file(temp_path, company_name, qualification_type) 保存到数据库 conn = psycopg2.connect( host="localhost", database="archive_db", user="postgres", password="YourStrongPassword123" ) cursor = conn.cursor() cursor.execute(""" INSERT INTO environmental_qualifications (qualification_number, company_name, qualification_type, issue_date, expiry_date, issuing_authority, file_metadata) VALUES (%s, %s, %s, %s, %s, %s, %s) """, ( ocr_data.get('qualification_number', ''), company_name, qualification_type, datetime.strptime(issue_date, '%Y-%m-%d').date(), datetime.strptime(ocr_data.get('expiry_date', '2099-12-31'), '%Y-%m-%d').date(), ocr_data.get('issuing_authority', ''), {'storage_path': object_name, 'original_filename': file.filename} )) conn.commit() cursor.close() conn.close() return {"message": "上传成功", "object_name": object_name} ```使用Nginx作为反向代理,并配置SSL证书。
创建 /etc/nginx/sites-available/archive.conf:
``` server { listen 80; server_name archive.yourdomain.com; return 301 https://$server_name$request_uri; } server { listen 443 ssl http2; server_name archive.yourdomain.com; ssl_certificate /etc/letsencrypt/live/archive.yourdomain.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/archive.yourdomain.com/privkey.pem; location /api/ { proxy_pass http://127.0.0.1:8000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } location / { root /var/www/archive-frontend; index index.html; try_files $uri $uri/ /index.html; } } ```创建Systemd服务文件 /etc/systemd/system/archive-api.service:
``` [Unit] Description=Archive Management API After=network.target [Service] User=www-data WorkingDirectory=/opt/archive-system ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8000 Restart=always [Install] WantedBy=multi-user.target ```启动服务:
``` sudo systemctl daemon-reload sudo systemctl enable archive-api sudo systemctl start archive-api ```配置每日自动备份数据库和文件存储。
创建备份脚本 /opt/backup/backup.sh:
``` !/bin/bash BACKUP_DIR="/opt/backup/data" DATE=$(date +%Y%m%d) 备份PostgreSQL docker exec postgres-archive pg_dump -U postgres archive_db > \ $BACKUP_DIR/archive_db_$DATE.sql 备份MinIO数据(使用mc客户端) /usr/local/bin/mc mirror --overwrite \ local/qualifications \ $BACKUP_DIR/minio_$DATE/ 压缩备份 tar -czf $BACKUP_DIR/archive_backup_$DATE.tar.gz \ $BACKUP_DIR/archive_db_$DATE.sql \ $BACKUP_DIR/minio_$DATE/ 保留最近30天备份 find $BACKUP_DIR -name "archive_backup_.tar.gz" -mtime +30 -delete ```设置备份任务权限并加入Cron:
``` chmod +x /opt/backup/backup.sh 每天凌晨2点执行备份 0 2 /opt/backup/backup.sh ```至此,一个具备环保资质电子化归档、OCR识别、有效期智能提醒的完整档案管理系统已部署完成。系统每天自动检查资质有效期并发送提醒,所有文件安全存储在分布式对象存储中,可通过Web界面进行检索和下载。后续可根据需要扩展多级审批流程、移动端适配等功能模块。