This integration transfers data from Apify Actors to a Postgres SQL database (with PGVector extension).
The Apify PGVector integration transfers selected data from Apify Actors to PostgreSQL (with PGVector extension). It processes the data, optionally splits it into chunks, computes embeddings, and saves them to PostgreSQL.
This integration supports incremental updates, updating only the data that has changed. This approach reduces unnecessary embedding computation and storage operations, making it suitable for search and retrieval augmented generation (RAG) use cases.
💡 Note: This Actor is meant to be used together with other Actors' integration sections. For instance, if you are using the Website Content Crawler, you can activate PGVector integration to save web data as vectors to PostgreSQL.
Apify PGVector integration computes text embeddings and store them in PostgreSQL. It uses LangChain to compute embeddings and interact with PGVector.
langchain
's RecursiveCharacterTextSplitter
(enable/disable using performChunking
and specify chunkSize
, chunkOverlap
)dataUpdatesStrategy
)OpenAI
or Cohere
(specify embeddings
and embeddingsConfig
)To utilize this integration, ensure you have:
PostgreSQL
database with PGVector extension. You need to know postgresSqlConnectionStr
and postgresCollectionName
.The configuration consists of three parts: PGVector, embeddings provider, and data.
Ensure that the vector size of your embeddings aligns with the configuration of your PostgreSQL.
For instance, if you're using the text-embedding-3-small
model from OpenAI
, it generates vectors of size 1536
.
This means your PostgreSQL vector should also be configured to accommodate vectors of the same size, 1536
in this case.
For detailed input information refer to the Input page.
1{ 2 "postgresSqlConnectionStr": "postgresql://postgres:password@localhost:5432/apify", 3 "postgresCollectionName": "apify-collection" 4}
1{ 2 "embeddingsProvider": "OpenAIEmbeddings", 3 "embeddingsApiKey": "YOUR-OPENAI-API-KEY", 4 "embeddingsConfig": {"model": "text-embedding-3-large"} 5}
Data is transferred in the form of a dataset from Website Content Crawler, which provides a dataset with the following output fields (truncated for brevity):
1{ 2 "url": "https://www.apify.com", 3 "text": "Apify is a platform that enables developers to build, run, and share automation tasks.", 4 "metadata": {"title": "Apify"} 5}
This dataset is then processed by the PGVector integration.
In the integration settings you need to specify which fields you want to save to PostgreSQL, e.g., ["text"]
and which of them should be used as metadata, e.g., {"title": "metadata.title"}
.
Without any other configuration, the data is saved to PostgreSQL as is.
1{ 2 "datasetFields": ["text"], 3 "metadataDatasetFields": {"title": "metadata.title"} 4}
Assume that the text data from the Website Content Crawler is too long to compute embeddings.
Therefore, we need to divide the data into smaller pieces called chunks.
We can leverage LangChain's RecursiveCharacterTextSplitter
to split the text into chunks and save them into a database.
The parameters chunkSize
and chunkOverlap
are important.
The settings depend on your use case where a proper chunking helps optimize retrieval and ensures accurate responses.
1{ 2 "datasetFields": ["text"], 3 "metadataDatasetFields": {"title": "metadata.title"}, 4 "performChunking": true, 5 "chunkSize": 1000, 6 "chunkOverlap": 0 7}
To control how the integration updates data in the database, use the dataUpdatesStrategy
parameter. This parameter allows you to choose between different update strategies based on your use case, such as adding new data, upserting records, or incrementally updating records based on changes (deltas). Below are the available strategies and explanations for when to use each:
Add data (add
):
Upsert data (upsert
):
dataUpdatesPrimaryDatasetFields
parameter to specify which fields are used to uniquely identify each dataset item.Delta updates (deltaUpdates
):
dataUpdatesPrimaryDatasetFields
parameter to specify which fields are used to uniquely identify each dataset item.To incrementally update data from the Website Content Crawler to database, configure the integration to update only the changed or new data.
This is controlled by the dataUpdatesStrategy
setting.
This way, the integration minimizes unnecessary updates and ensures that only new or modified data is processed.
A checksum is computed for each dataset item (together with all metadata) and stored in the database alongside the vectors.
When the data is re-crawled, the checksum is recomputed and compared with the stored checksum.
If the checksum is different, the old data (including vectors) is deleted and new data is saved.
Otherwise, only the last_seen_at
metadata field is updated to indicate when the data was last seen.
To incrementally update the data, you need to be able to uniquely identify each dataset item.
The variable dataUpdatesPrimaryDatasetFields
specifies which fields are used to uniquely identify each dataset item and helps track content changes across different crawls.
For instance, when working with the Website Content Crawler, you can use the URL as a unique identifier.
1{ 2 "dataUpdatesStrategy": "deltaUpdates", 3 "dataUpdatePrimaryDatasetFields": ["url"] 4}
To fully maximize the potential of incremental data updates, it is recommended to start with an empty database. While it is possible to use this feature with an existing database, records that were not originally saved using a prefix or metadata will not be updated.
The integration can delete data from the database that hasn't been crawled for a specified period, which is useful when data becomes outdated, such as when a page is removed from a website.
The deletion feature can be enabled or disabled using the deleteExpiredObjects
setting.
For each crawl, the last_seen_at
metadata field is created or updated.
This field records the most recent time the data object was crawled.
The expiredObjectDeletionPeriodDays
setting is used to control number of days since the last crawl, after which the data object is considered expired.
If a database object has not been seen for more than the expiredObjectDeletionPeriodDays
, it will be deleted automatically.
The specific value of expiredObjectDeletionPeriodDays
depends on your use case.
expiredObjectDeletionPeriodDays
can be set to 7.To disable this feature, set deleteExpiredObjects
to false
.
1{ 2 "deleteExpiredObjects": true, 3 "expiredObjectDeletionPeriodDays": 30 4}
💡 If you are using multiple Actors to update the same database, ensure that all Actors crawl the data at the same frequency. Otherwise, data crawled by one Actor might expire due to inconsistent crawling schedules.
This integration will save the selected fields from your Actor to PostgreSQL.
1{ 2 "postgresSqlConnectionStr": "postgresql://postgres:password@localhost:5432/apify", 3 "postgresCollectionName": "apify-collection", 4 "embeddingsApiKey": "YOUR-OPENAI-API-KEY", 5 "embeddingsConfig": { 6 "model": "text-embedding-3-small" 7 }, 8 "embeddingsProvider": "OpenAI", 9 "datasetFields": [ 10 "text" 11 ], 12 "dataUpdatesStrategy": "deltaUpdates", 13 "dataUpdatePrimaryDatasetFields": ["url"], 14 "expiredObjectDeletionPeriodDays": 7, 15 "performChunking": true, 16 "chunkSize": 2000, 17 "chunkOverlap": 200 18}
1{ 2 "postgresSqlConnectionStr": "postgresql://postgres:password@localhost:5432/apify", 3 "postgresCollectionName": "apify-collection" 4}
1{ 2 "embeddingsApiKey": "YOUR-OPENAI-API-KEY", 3 "embeddings": "OpenAI", 4 "embeddingsConfig": {"model": "text-embedding-3-large"} 5}
1{ 2 "embeddingsApiKey": "YOUR-COHERE-API-KEY", 3 "embeddings": "Cohere", 4 "embeddingsConfig": {"model": "embed-multilingual-v3.0"} 5}
To start a local PostgresSQL database with PGVector using Docker, refer to the docker-compose.yaml file and run the following command:
docker-compose up
You can connect to the database using psql with the following command:
psql -h localhost -p 5324 -U postgres -d apify
Or you can use PGAdmin to connect to the database.
docker run -e PGADMIN_DEFAULT_EMAIL=*@apify.com -e PGADMIN_DEFAULT_PASSWORD=root -p 8000:80 dpage/pgadmin4
LangChain uses the concept of collections to store data.
Collections help to separate data for different projects or use cases.
For each collection, two tables are created: langchain_pg_embedding
and langchain_pg_collection
.
The langchain_pg_embedding
table stores the embeddings, page_content, and associated metadata.
The langchain_pg_collection
table stores the list of collections.
LangChain will automatically create these tables when the first embedding is saved to a collection.
Yes, if you're scraping publicly available data for personal or internal use. Always review Websute's Terms of Service before large-scale use or redistribution.
No. This is a no-code tool — just enter a job title, location, and run the scraper directly from your dashboard or Apify actor page.
It extracts job titles, companies, salaries (if available), descriptions, locations, and post dates. You can export all of it to Excel or JSON.
Yes, you can scrape multiple pages and refine by job title, location, keyword, or more depending on the input settings you use.
You can use the Try Now button on this page to go to the scraper. You’ll be guided to input a search term and get structured results. No setup needed!