Vector Database Examples

References:

Building AI-powered apps on Google Cloud databases using pgvector, LLMs and LangChain, colab
Postgres pgvector Extension - Vector Database with PostgreSQL / Langchain Integration
Duncan Blythe: An overview of vector search libraries and databases (2023)
Hacker News thread: Vector search just got up to 10x faster, easier to set up, and vertically scalable
Dmitry Kan:
- Not All Vector Databases Are Made Equal (2023)
- Where Vector Search is Taking Us (Sept 2022)
Reddit: Open source vector databases? (2023)
Marqo
- Simple wiki demo - gives errors
- Create marqo venv, pip install marqo pytorch`
https://healthsearch-frontend.onrender.com/
Tutorial: How to get GPT to “read” from PDFs
12 Vector Databases For 2023: A Review
LabelStudio
- Customized Layout Detection for Scientific PDFs with LayoutParser and Label Studio (2022)
Unstructured
- Install
  - pip install unstructured unstructured-inference
  - pip install git+https://github.com/facebookresearch/detectron2.git
  - sudo apt install tesseract-ocr
  - sudo apt install libtesseract-dev
- examples
- Matt Robinson: The Unstructured library now includes utilities to make loading Unstructured outputs into Weaviate quick and easy.
- https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate
- https://baincapitalventures.com/insight/how-unstructured-is-powering-the-llm-data-stack/
- Who here were able to use Unstructured for parsing scanned PDFs? #5969
Weaviate:
- Install
  - Requires unstructured
- Erika Cardenas: Vector Library versus Vector Database (2023)
- Erika Cardenas, Mohd Shukri Hasan: Ingesting PDFs into Weaviate (2023)
Weavite chat:
- Andrei: Hi, I’m trying to use Weaviate to have chat and summarization with a large number of PDFs (hundreds). But I can’t seem to find a good blog post that explains the process end to end.Any suggestions on which blog post or tutorial to use as example?Been trying to use unstructured to parse the pdf, and insert it into Weaviate. That seems to work. However, the class properties of the pdf parsed by unstructured do not seem to be documented anywhere. There is one blog example, https://weaviate.io/blog/ingesting-pdfs-into-weaviate, but that only inserts document summaries.There is also a 2021 blog post, https://towardsdatascience.com/getting-started-with-weaviate-python-client-e85d14f19e4f, which shows how to insert news articles into Weaviate. However, the code in that blog post is not running with the latest Weaviate.
  - Erika: I recommend using the Unstructured data loader on LlamaHub: file-unstructured. You can then build your vector store with Weaviate, and then use the query engine. An example of this is here: episode1.ipynb. If you want to summarize the PDF docs, you should build out the index using the Tree Index
Chroma
- Clone https://github.com/chroma-core/chroma
- In docker-compose.yaml
  - Change port 8000 to 8002, to avoid conflict with Airbyte port 8000
  - Add ALLOW_RESET=TRUE

Here is the docker-compose.yaml:

version: '3.9'

networks:
  net:
    driver: bridge

services:
  server:
    image: server
    build:
      context: .
      dockerfile: Dockerfile
    volumes:
      - ./:/chroma
      - index_data:/index_data
    command: uvicorn chromadb.app:app --reload --workers 1 --host 0.0.0.0 --port 8002 --log-config log_config.yml
    environment:
      - IS_PERSISTENT=TRUE
      - ALLOW_RESET=TRUE
      - CHROMA_SERVER_AUTH_PROVIDER=${CHROMA_SERVER_AUTH_PROVIDER}
      - CHROMA_SERVER_AUTH_CREDENTIALS_FILE=${CHROMA_SERVER_AUTH_CREDENTIALS_FILE}
      - CHROMA_SERVER_AUTH_CREDENTIALS=${CHROMA_SERVER_AUTH_CREDENTIALS}
      - CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER=${CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER}
    ports:
      - 8002:8002
    networks:
      - net

volumes:
  index_data:
    driver: local
  backups:
    driver: local

Do docker-compose up -d --build to bring up, and docker-compose down to bring down. Then, you can connect to remote Chroma DB as follows:

# create the chroma client
import chromadb
import uuid
from chromadb.config import Settings

client = chromadb.HttpClient(host='localhost', port=8002, settings=Settings(allow_reset=True))
client.reset()  # resets the database
collection = client.create_collection("my_collection")
for doc in docs:
    collection.add(
        ids=[str(uuid.uuid1())], metadatas=doc.metadata, documents=doc.page_content
    )

# tell LangChain to use our client and collection name
db4 = Chroma(client=client, collection_name="my_collection", embedding_function=embedding_function)
query = "What did the president say about Ketanji Brown Jackson"
docs = db4.similarity_search(query)
print(docs[0].page_content)

Aug 31st Unstructured, Weaviate, Arize webinar