What this means
Enterprise knowledge lives across multiple platforms: SharePoint for document management and intranets, Confluence for team wikis and technical documentation, shared drives for working files and project archives. Connecting AI retrieval to these sources means ingesting their content into a vector store so it can be searched semantically at query time.
The process involves four components: a connector that authenticates to the source system and pulls content via API; a processing pipeline that extracts text, applies chunking and attaches metadata; a vector store that indexes the content for retrieval; and a permissions layer that ensures results are scoped to what the querying user is authorised to see.
Why it matters for business
The business case is direct: employees spend significant time searching for information they know exists but cannot locate. An AI assistant that can search across SharePoint, Confluence and other internal sources in response to natural-language queries reduces this friction. For knowledge-intensive Australian organisations — professional services, financial services, healthcare, government — this translates to measurable time savings and reduced risk of staff operating on outdated information.
According to Anthropic's 2026 enterprise AI report, internal process automation and data retrieval are among the highest-impact non-coding AI use cases, cited by 48% and 60% of organisations respectively. Retrieval over internal knowledge sources is the foundational layer for both.
How it works technically
SharePoint: Microsoft Graph API provides programmatic access to SharePoint document libraries, pages and lists. Connectors authenticate via Azure AD OAuth and can enumerate site collections, libraries and files with their associated permissions (read access lists, group memberships). Content is extracted as text, chunked and indexed. Permission metadata — which AAD groups can access which files — is stored alongside vectors and used to filter retrieval queries.
Confluence: Atlassian's REST API provides access to spaces, pages and attachments. Confluence's permission model (space permissions, page restrictions) is more granular than SharePoint's flat library model, requiring permission inheritance to be correctly resolved during ingestion. Content is typically structured as a hierarchy of spaces → pages → child pages, which informs the chunking strategy.
Shared drives and network files: Traditional file shares (CIFS/SMB, NFS) or cloud drives (Google Drive, OneDrive) require a crawling component that walks directory structures, extracts content from supported file types (DOCX, PDF, XLSX, PPTX, TXT) and maps NTFS or cloud ACLs to a permission model the retrieval layer can use. File format handling is a non-trivial engineering task — PDFs with scanned pages require OCR; complex XLSX files may not chunk sensibly into prose.
A common architectural pattern is a scheduled ingestion pipeline that runs on a defined cadence (daily or on change event), pulling new or modified content from source systems, processing it through the chunking and embedding pipeline and upserting into the vector store. Change detection via APIs (delta queries in Microsoft Graph, Confluence change events) reduces the ingestion load for large corpora.
Practical implementation considerations
Permission handling is the most consequential engineering decision in a multi-source retrieval deployment. A retrieval system that ignores source permissions will surface documents to users who should not have access to them. Under Australia's Privacy Act 1988 and Notifiable Data Breaches scheme, this constitutes a data handling failure.
The implementation options are: (1) user-level permission filtering, where the retrieval query is constrained to documents the querying user can access, evaluated at query time against stored permission metadata; or (2) per-user vector indexes, where each user sees only their own permitted corpus. Option 1 is more scalable; option 2 is simpler but storage-intensive.
Content deduplication is the second critical consideration. Documents in SharePoint and Confluence are frequently duplicated — the same policy exists in three versions in four team sites. Without deduplication, the vector store contains redundant embeddings that generate conflicting or repetitive retrieved context. Edison AI's AI implementation team treats deduplication as a standard step in the ingestion pipeline design.
Connector maintenance is ongoing. As source system structures change — new sites, renamed spaces, restructured drives — connectors need to be updated. This operational overhead should be budgeted explicitly.
Common mistakes
- Deploying without permission filtering. This is the most serious failure mode and the most common shortcut taken in proof-of-concept builds that proceed to production.
- Ingesting all content without curation. The full corpus of a large SharePoint environment includes draft documents, superseded policies, personal working files and content from departed employees. Indiscriminate ingestion degrades retrieval quality and increases the surface area for permission errors.
- Relying on connectors without testing retrieval quality. A connector that successfully ingests documents does not guarantee that relevant documents are retrieved for real user queries. Evaluation on a representative query set is required.
- No change detection or re-indexing schedule. An initial ingestion that is never refreshed produces a stale knowledge base within weeks.
- Ignoring file type diversity. Treating all files as extractable text fails silently for scanned PDFs, image-heavy presentations and structured spreadsheets that do not reduce to natural language prose.
What leaders should do next
Define the scope of knowledge sources to connect before beginning any technical work. For each source, map the permission model and confirm that your retrieval architecture can enforce it. Commission a content audit of the highest-priority source to assess quality before ingestion. Build a change detection and re-indexing pipeline from the start — retrofitting it later is significantly more costly. Plan for ongoing connector maintenance as a recurring operational cost.
Edison AI builds bespoke AI systems — including retrieval over your own documents — for Australian businesses.