Vector Database Examples
References:
- Building AI-powered apps on Google Cloud databases using pgvector, LLMs and LangChain, colab
- Postgres pgvector Extension - Vector Database with PostgreSQL / Langchain Integration
- Duncan Blythe: An overview of vector search libraries and databases (2023)
- Hacker News thread: Vector search just got up to 10x faster, easier to set up, and vertically scalable
- Dmitry Kan:
- Not All Vector Databases Are Made Equal (2023)
- Where Vector Search is Taking Us (Sept 2022)
- Reddit: Open source vector databases? (2023)
- Marqo
- Simple wiki demo - gives errors
- Create marqo venv,
pip install marqo pytorch
`
- https://healthsearch-frontend.onrender.com/
- Tutorial: How to get GPT to “read” from PDFs
- LabelStudio
- Unstructured
- Install
pip install unstructured unstructured-inference
pip install git+https://github.com/facebookresearch/detectron2.git
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
- examples
- Matt Robinson: The Unstructured library now includes utilities to make loading Unstructured outputs into Weaviate quick and easy.
- https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate
- https://baincapitalventures.com/insight/how-unstructured-is-powering-the-llm-data-stack/
- Who here were able to use Unstructured for parsing scanned PDFs? #5969
- Install
- Weaviate:
- Install
- Requires
unstructured
- Requires
- Erika Cardenas: Vector Library versus Vector Database (2023)
- Erika Cardenas, Mohd Shukri Hasan: Ingesting PDFs into Weaviate (2023)
- Install
- Weavite chat:
- Andrei: Hi, I’m trying to use Weaviate to have chat and summarization with a large number of PDFs (hundreds). But I can’t seem to find a good blog post that explains the process end to end.Any suggestions on which blog post or tutorial to use as example?Been trying to use unstructured to parse the pdf, and insert it into Weaviate. That seems to work. However, the class properties of the pdf parsed by unstructured do not seem to be documented anywhere. There is one blog example, https://weaviate.io/blog/ingesting-pdfs-into-weaviate, but that only inserts document summaries.There is also a 2021 blog post, https://towardsdatascience.com/getting-started-with-weaviate-python-client-e85d14f19e4f, which shows how to insert news articles into Weaviate. However, the code in that blog post is not running with the latest Weaviate.
- Erika: I recommend using the Unstructured data loader on LlamaHub: file-unstructured. You can then build your vector store with Weaviate, and then use the query engine. An example of this is here: episode1.ipynb. If you want to summarize the PDF docs, you should build out the index using the Tree Index
- Andrei: Hi, I’m trying to use Weaviate to have chat and summarization with a large number of PDFs (hundreds). But I can’t seem to find a good blog post that explains the process end to end.Any suggestions on which blog post or tutorial to use as example?Been trying to use unstructured to parse the pdf, and insert it into Weaviate. That seems to work. However, the class properties of the pdf parsed by unstructured do not seem to be documented anywhere. There is one blog example, https://weaviate.io/blog/ingesting-pdfs-into-weaviate, but that only inserts document summaries.There is also a 2021 blog post, https://towardsdatascience.com/getting-started-with-weaviate-python-client-e85d14f19e4f, which shows how to insert news articles into Weaviate. However, the code in that blog post is not running with the latest Weaviate.
- Chroma
- Clone https://github.com/chroma-core/chroma
- In
docker-compose.yaml
- Change port 8000 to 8002, to avoid conflict with Airbyte port 8000
- Add
ALLOW_RESET=TRUE
Here is the docker-compose.yaml
:
version: '3.9'
networks:
net:
driver: bridge
services:
server:
image: server
build:
context: .
dockerfile: Dockerfile
volumes:
- ./:/chroma
- index_data:/index_data
command: uvicorn chromadb.app:app --reload --workers 1 --host 0.0.0.0 --port 8002 --log-config log_config.yml
environment:
- IS_PERSISTENT=TRUE
- ALLOW_RESET=TRUE
- CHROMA_SERVER_AUTH_PROVIDER=${CHROMA_SERVER_AUTH_PROVIDER}
- CHROMA_SERVER_AUTH_CREDENTIALS_FILE=${CHROMA_SERVER_AUTH_CREDENTIALS_FILE}
- CHROMA_SERVER_AUTH_CREDENTIALS=${CHROMA_SERVER_AUTH_CREDENTIALS}
- CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER=${CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER}
ports:
- 8002:8002
networks:
- net
volumes:
index_data:
driver: local
backups:
driver: local
Do docker-compose up -d --build
to bring up, and docker-compose down
to bring down. Then, you can connect to remote Chroma DB as follows:
# create the chroma client
import chromadb
import uuid
from chromadb.config import Settings
client = chromadb.HttpClient(host='localhost', port=8002, settings=Settings(allow_reset=True))
client.reset() # resets the database
collection = client.create_collection("my_collection")
for doc in docs:
collection.add(
ids=[str(uuid.uuid1())], metadatas=doc.metadata, documents=doc.page_content
)
# tell LangChain to use our client and collection name
db4 = Chroma(client=client, collection_name="my_collection", embedding_function=embedding_function)
query = "What did the president say about Ketanji Brown Jackson"
docs = db4.similarity_search(query)
print(docs[0].page_content)
- Aug 31st Unstructured, Weaviate, Arize webinar